Immediate evaluation of CTE - hadoop

I am trying to optimize a very long and complex impala query which contains multiple CTE. Each CTE is used multiple times. My expectation is that once a CTE is created, I should be able to direct impala that results of this CTE should be re-used in main query as-is instead of SCAN HDFS operation on tables involved in CTE again with the main query. Is this possible? if yes how ?
I am using impalad version 2.1.1-cdh5 RELEASE (build 7901877736e29716147c4804b0841afc4ebc9037) version

I do not think so. I believe WITH clause does not create any permanent object, it's just for you avoid cluttering the namespace with new tables or views and to make it easier to refactor large, complex queries by reor‐
dering and replacing their individual parts. The queries used in the WITH clause are good candidates to someday become views or to be materialized as summary tables during the ETL process.

Is this possible?
The very purpose of the CTE is to re-use of the results obtained from a preceding query (that uses the with clause) by a following query, say, SELECT. So i don't see a reason why it is not possible.
Use Explain on your query to find out the actual SCAN HDFS details.
For more I/O related insights use profile as in the official documentation https://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_explain_plan.html#perf_profile

Related

Oracle not uses best dbplan

i'm struggeling with Performance in oracle. Situation is: Subsystem B has a dblink to master DB A. on System B a query completes after 15 seconds over dblink, db plan uses appropriate indexes.
If same query should fill a table in a stored procedure now, Oracle uses another plan with full scans. whatever i try (hints), i can't get rid of these full scans. that's horrible.
What can i do?
The Oracle Query Optimizer tries 2000 different possibilities and chooses the best one in normal situations. But if you think it choose wrong plan, You may suspect the following cases:
1- Your histograms which belongs to querying tables are deprecated.
2- Your indexes can not be used because of your faulty query.
3- You can use index hints to force the indexes to be used.
4- You can use SQL Advisor or run TKProf for performance analysis and decide what's wrong or what caused bad performance. Check network, Disk I/O values etc.
If you share your query we can give you more information.
Look like we are not taking same queries in two different conditions.
First case is Simple select over dblink & Second case is "insert as select over dblink".
can you please share two queries & execution plans here as You may have them handy. If its not possible to past queries due to security limitations, please past execution plans.
-Abhi
after many tries, I could create a new DB Plan with Enterprise Manager. now it's running perfect.

Can full information about an Oracle schema data-model be selected atomically?

I'm instantiating a client-side representation of an Oracle Schema data-model in custom Table/Column/Constraint/Index data structures, in C/C++ using OCI. For this, I'm selecting from:
all_tables
all_tab_comments
all_col_comments
all_cons_columns
all_constraints
etc...
And then I'm using OCI to describe all tables, for precise information about column types. This is working, but our CI testing farm is often failing inside this schema data-model introspection code, because another test is running in parallel and creating/deleting tables in the middle of this serie of queries and describe calls I'm making.
My question is thus how can I introspect this schema atomically such that another session does not concurrently change that very schema I'm instropecting?
Would using a Read-only Serializable transaction around the selects and describes be enough? I.e. does MVCC apply to Oracle's data dictionaries? What would be the likelihood of SnapShot too Old errors on such system dictionaries?
If full atomicity is not possible, are there steps I could take to minimize the possibility of getting inconsistent / stale info?
I was thinking maybe left-joins to reduce the number of queries, and/or replacing the OCIDescribeAny() calls with other dictionary accesses joined to other tables, to get all table/column info in a single query each?
I'd appreciate some expert input on this concurrency issue. Thanks, --DD
a typical read-write conflict. from the top of my head i see 2 ways around it:
use dbms_lock package in both "introspection" and "another test".
rewrite your retrospection query so that it returns one big thing of what you need. there are multiple ways to do that:
use xmlagg and alike.
use listagg and get one big string or clob.
just use a bunch of unions to get one resultset, as it's guaranteed to be consistent.
hope that helps.

Can we boost the performance of COUNT, DISTINCT and LIKE queries?

As far as I understand, when we run SQL query with COUNT, DISTINCT or LIKE %query% (wildcards at both sides) keywords the indexes cannot be used and the database have to do the full table scan.
Is there some way to boost the performance of these queries?
Do they really cannot use indexes or we can fix this somehow?
Can we make an index-only scan if we need to return only one column? For example: select count(id) from MY_TABLE: probably in this case we can make index-only scan and avoid hitting the whole table if we have index on 'id'?
My question has a general meaning: could you give me some performance guidelines if we have to use the mentioned operators?
UPDATE
As for me I use PostgreSQL.
with PostgreSQL, you can create GIN pg_trgm indexes for text strings to make LIKE '%foo%' faster, though this requires addons, and PostgreSQL 9.1 or higher.
I doubt distinct by itself will ever use an index. I tried in fact and could not get it to use one. You can sort of force an index to be used by using a recursive CTE to pull individual records out (what can be called a "sparse scan"). We do something like this when pulling individual years out of the accounting record. This requires writing special queries though and so isn't really the general case.
count(*) is never going to be able to use an index due to mvcc rules. You can get approximate results by looking in the appropriate system catalogs however.

how to reduce the database's pressure

I have a database(sql server 2005),now there are about 100000 records in the table called users, when I do query use linq to sql, it is very slower and slower.how can I do some operate to improve the speed?
Analyse your query and add some indexes to your table may help.
To get a more specific answer post more specific information (table stucture, indexes you have, the sql code L2S generates, ...)
You could (in order of preference)
Save your query as a stored procedure
Add indexes to your users
table, for what you are querying for/sorting for
Analyze your query
(if it is complicated), see if there's a less-resource-intensive way
of doing it. There are graphical query analyzers to help you.
As a last resort, not use LINQ, but instead ADO.NET Entity Framework, it's significantly faster. But you'll only see performance improvements for crazy stuff, and only if you've already done all of the above.
Use stored procedures and then use linq to sql to get the desired rows, this will give performance.
The best tools at your disposal for analyzing your database access and seeing what needs to be optimized are:
SQL Server Profiler
Graphical Execution Plans
The first one will allow you to see the exact queries being sent to your database from your application, which is especially useful if it turns out that your application is chattier than you think. The second one will allow you to take those queries and see exactly what the SQL server is doing with them.
In the graphical execution plan, look for steps which use a lot of CPU and paths which transfer a lot of records. Those are what you'll want to optimize. It's possible that you're doing a table scan somewhere, which is slow, or maybe joining on many more records than you need somewhere, which is slow, etc.

Performance on joins in linq

HI ,
I am going to rewrite a store procedure in LINQ.
What this sp is doing is joining 12 tables and get the data and insert it into another table.
it has 7 left outer joins and 4 inner joins.And returns one row of data.
Now question.
1)What is the best way to achieve this joins in linq.
2) do you think this affect performance (its only retrieving one row of data at a given point of time)
Please advice.
Thanks
SNA.
You might want to check this question for the multiple joins. I usually prefer lambda syntax, but YMMV.
As for performance: I doubt the query performance itself will be affected, but there may be some overhead in figuring out the execution plan, since it's such a complicated query. The biggest performance hit will likely be the extra database round trip you will need compared to the stored procedure. If I understand you correctly, your current SP does the SELECT AND INSERT all at once. Using LINQ to SQL or LINQ to Entities, you will need to fetch the data first before you can actually write them to the other table.
So, it depends on your usage if rewriting is warranted. Alternatively, you can add stored procedures to your data model. It will be exposed as a method on your data context.

Resources