Different execution plans for remote DELETE and INSERT / same JOINs - performance

I'm working in an environment where data exists on numerous client databases and one by one is pulled into a central data repository via sql.
To automate a testing process, I've written a really nice, streamlined push-button script that backups, purges and re-extracts data on a user-specified client database. It then restores the data from backup tables. It makes heavy use of synonyms to streamline the code.
I'm running into a performance problem with the purge process where the DELETE query incurs a Remote Scan. It exactly the same query as the INSERT/SELECT which is simply passed as a Remote Query.
This INSERT works great:
INSERT INTO origChild
SELECT child.*
FROM
bakParent par
JOIN bakChild child ON par.GUID = child.GUID
WHERE
par.DateInserted = '2013-08-12 20:30:42.920'
This DELETE performs poorly:
DELETE
bakChild
FROM
bakParent par
JOIN bakChild child ON par.GUID = child.GUID
WHERE
par.DateInserted = '2013-08-12 20:30:42.920'
Below are the estimated query execution plans. The Remote Scan pulls 5M+ records while the INSERT/SELECT only deals with ~16,000 records.
I can't figure out why the plans are so different. I understand that queries to linked servers can lead to performance issues but the two JOINs are identical. I would expect them to be the same. (Or there should be a way for me to get the DELETE to perform similarly to the INSERT.)
I have confirmed that removing the INSERT part in the first query has the same execution plan.
Any suggestions?

Related

What resources are used on a local Oracle database when a query is executed against a linked database?

I am writing an application in C# that will execute queries against a local and linked databases (all Oracle 10g and newer), and I want to make sure I understand who is doing what when a linked database is being queried.
For example, for a simple query such as
SELECT * FROM FOO#DB_LINK
What is the local database server responsible for? I assume that this will use the CPU, disk, and memory on the database server that hosts DB_LINK, but what impact does this query have on the local database server resources?
What if the query is a little more complex, such as
SELECT * FROM FOO#DB_LINK F INNER JOIN BAR#DB_LINK B ON F.FOOBAR = B.FOOBAR
Is the entire query executed on the server that hosts DB_LINK, or is the INNER JOIN performed on the local server? If the INNER JOIN is performed by the local database, is it able to utilize the indexes that are on the linked tables (I wouldn't think so)? Is there a way to tell Oracle to execute the entire query on the linked database?
In my application, my queries will always be completely against either the local database, or a selected linked database. In other words, I will never have a query where I am getting data from both the local and a linked database at the same time like
SELECT * FROM FOO F INNER JOIN BAR#DB_LINK B ON F.FOOBAR = B.FOOBAR
To summarize,
I am only dealing with Oracle 10g or newer databases.
What is the local database server responsible for when a query (however complex) is strictly against linked tables?
What are the ways (if any) to optimize or give Oracle hints about how to best execute these kinds of queries? (examples in C# would be great)
Like most things related to the optimizer, it depends.
If you generate a query plan for a particular query, the query plan will tell you what if anything the local database is doing and which operations are being done on the remote database. Most likely, if statistics on the objects are reasonably accurate and the query references only objects in a single remote database, the optimizer will be smart enough to push the entire query to the remote server to execute.
Alas, the optimizer is not always going to be smart enough to do the right thing. If that happens, you can most likely resolve it by adding an appropriate driving_site hint to the query.
SELECT /*+ driving_site(F) */ *
FROM FOO#DB_LINK F
INNER JOIN BAR#DB_LINK B
ON F.FOOBAR = B.FOOBAR
Depending on how complex the queries are, how difficult it is to add hints to your queries, and how much difficulty you have in your environment getting the optimizer to behave, creating views in the remote database can be another way to force queries to run on the remote database. If you create a view on db_link that joins the two tables together and query that view over the database link, that will (in my experience) always force the execution to happen on the remote database where the view is defined. I wouldn't expect this option to be needed given the fact that you aren't mixing local and remote objects but I include it for completeness.
A 100% remote query will get optimized by the remote instance. The local instance will still need to allocate some memory and use CPU in order to fetch results from the remote server but the main work (things like hash joins and looping) will all be done by the remote instance.
When this happens, you will get a note in your local execution plan
Note
-----
- fully remote statement
As soon as something is required to be done on the local server (e.g an insert or if you join to a local table (including local dual)) as part of the statement, then the query becomes distributed, only one server can be considered the driving site and it will typically be local (I can't come up with a demo where this chooses the remote site, even when it's cheaper so maybe it's not cost based). Typically this will end up with you hitting some badness somewhere - perhaps a nested loop join against remote tables computed on the local side.
One thing to keep in mind with distributed queries - the optimizing instance will not look at histogram information from the other instance.

Why informatica fetches more records from source when source itself has less records

I have an issue in production env, one of the work flow is running more tgan one day and inserting records in to sql server db. It s just direct load mapping, there is no sq over ride as well. Monitor shows sq count as 7 million and inseting same no of records inyo target. But source db shows around 3 million records only. How can this be possible?
Have you checked if the source qualifier is joining more than one table? A screenshot of the affected mapping pipeline and obfuscated logfile would help.
Another thought... given your job ran for a day, were there any jobs ran in that time to purge old records from the source table?
Cases when I saw this kind of things happening:
There's a SQL Query override doing something different than I thought (eg. joining some tables)
I'm looking at a different source - verify the connections and make sure to check the same object on the same database at the same server the PowerCenter is connecting to.
It's a reusable session being executed multiple times by different workflows. In such case in workflow monitor it may happen that Source/Target Statistics will refer to another execution.

Apache Drill has bad performance against SQL Server

I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was:
SELECT p.Product_Category, SUM(f.sales)
FROM facts f
JOIN Product p on f.pkey = p.pkey
GROUP BY p.Product_Category
Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows.
First I tested this query on SqlServer and got a result back in about 150ms.
With drill I first tried to connect directly to SqlServer and run the query, but that was slow (about 5 sec).
Then I tried saving the tables into json files and reading from them, but that was even slower, so I tried parquet files.
I got the result back in the first run in about 3 sec. next run was about 900ms and then it stabled at about 500ms.
From reading around, this makes no sense and drill should be faster!
I tried "REFRESH TABLE METADATA", but the speed didn't change.
I was running this on windows, through the drill command line.
Any idea if I need some extra configuration or something?
Thanks!
Drill is very fast, but it's designed for large distributed queries while joining across several different data sources... and you're not using it that way.
SQL Server is one of the fastest relational databases. Data is stored efficiently, cached in memory, and the query runs in a single process so the scan and join is very quick. Apache Drill has much more work to do in comparison. It has to interpret your query into a distributed plan, send it to all the drillbit processes, which then lookup the data sources, access the data using the connectors, run the query, return the results to the first node for aggregation, and then you receive the final output.
Depending on the data source, Drill might have to read all the data and filter it separately which adds even more time. JSON files are slow because they are verbose text files that are parsed line by line. Parquet is much faster because it's a binary compressed column-oriented storage format designed for efficient scanning, especially when you're only accessing certain columns.
If you have a small dataset stored on a single machine then any relational database will be faster than Drill.
The fact that Drill gets you results in 500ms with Parquet is actually impressive considering how much more work it has to do to give you the flexibility it provides. If you only have a few million rows, stick with SQL server. If you have billions of rows, then use the SQL Server columnstore feature to store data in columnar format with great compression and performance.
Use Apache Drill when you:
Have 10s of billions of rows or more
Have data spread across many machines
Have unstructured data like JSON stored in files without a standard schema
Want to split the query across many machines to run in faster in parallel
Want to access data from different databases and file systems
Want to join data across these different data sources
One thing people need to understand about how Drill works is how Drill translates an SQL query to an executable plan to fetch and process data from, theoretically, any source of data. I deliberately didn't say data source so people won't think of databases or any software-based data management system.
Drill uses storage plugins to read records from whatever data the storage plugin supports.
After Drill gets these rows, it starts performing what is needed to execute the query, whats needed may be filtering, sorting, joining, projecting (selecting specific columns)...etc
So drill doesn't by default use any of the source's capabilities of processing the queried data. In fact, the source may not support any capability of such !
If you wish to leverage any of the source's data processing features, you'll have to modify the storage plugin you're using to access this source.
One query I regularly remember when I think about Drill's performance, is this one
Select a.CUST_ID, (Select count(*) From SALES.CUSTOMERS where CUST_ID < a.CUST_ID) rowNum from SALES.CUSTOMERS a Order by CUST_ID
Only because of the > comparison operator, Drill has to load the whole table (i.e actually a parquet file), SORT IT, then perform the join.
This query took around 18 minutes to run on my machine which is a not so powerful machine but still, the effort Drill needs to perform to process this query must not be ignored.
Drill's purpose is not to be fast, it's purpose is to handle vast amounts of data and run SQL queries against structured and semi-structured data. And probably other things that I can't think about at the moment but you may find more information for other answers.

cache table advanced before executing the spark sql

We are doing a POC to compare different tools including spark sql, apache drill and so forth. The bechmark dataset includes almost one thousand parquet files. For the same query, apache drill takes like several seconds while spark sql takes more than 40 minutes. I guess the running time for spark sql is dominated by reading files from disk. Actually, the POC aims at finding how long do the queries themselves take, we kind of do not worry about the spend in reading files from disk at current. We are wondering is there a way to cache all the tables at advance and then execute the test queries on these cached tables. We understand that cache is also lazy and caches data right after first action on a query. Our solution now is to use a dummy action and a dummy query to cache the table first and then execute the test queries.
For example:
table1 = sqlContext.read.load(path1)
table1.registerTempTable("table1")
sqlContext.cache("table1")
result1 = sqlContext.sql("sql1")
table2 = sqlContext.read.load(path2)
table2.registerTempTable("table2")
sqlContext.cache("table2")
result2 = sqlContext.sql("sql2")
result1.show() //dummy action
result2.show() //dummy action
I'm wondering what does the cache really do? As we know that cache is also lazy and it happens at the first action. Does it make any difference for using different dummy actions and different queries?

MERGE in Vertica

I would like to write a MERGE statement in Vertica database.
I know it can't be used directly, and insert/update has to be
combined to get the desired effect.
The merge sentence looks like this:
MERGE INTO table c USING (select b.field1,field2 aeg from table a, table b
where a.field3='Y'
and a.field4=b.field4
group by b.field1) t
on (c.field1=t.field1)
WHEN MATCHED THEN
UPDATE
set c.UUS_NAIT=t.field2;
Would just like to see an example of MERGE being used as insert/update.
You really don't want to do an update in Vertica. Inserting is fine. Selects are fine. But I would highly recommend staying away from anything that updates or deletes.
The system is optimized for reading large amounts of data and for inserting large amounts of data. So since you want to do an operation that does 1 of the 2 I would advise against it.
As you stated, you can break apart the statement into an insert and an update.
What I would recommend, not knowing the details of what you want to do so this is subject to change:
1) Insert data from an outside source into a staging table.
2) Perform and INSERT-SELECT from that table into the table you desire using the criteria you are thinking about. Either using a join or in two statements with subqueries to the table you want to test against.
3) Truncate the staging table.
It seems convoluted I guess, but you really don't want to do UPDATE's. And if you think that is a hassle, please remember that what causes the hassle is what gives you your gains on SELECT statements.
If you want an example of a MERGE statement follow the link. That is the link to the Vertica documentation. Remember to follow the instructions clearly. You cannot write a Merge with WHEN NOT MATCHED followed and WHEN MATCHED. It has to follow the sequence as given in the usage description in the documentation (which is the other way round). But you can choose to omit one completely.
I'm not sure, if you are aware of the fact that in Vertica, data which is updated or deleted is not really removed from the table, but just marked as 'deleted'. This sort of data can be manually removed by running: SELECT PURGE_TABLE('schemaName.tableName');
You might need super user permissions to do that on that schema.
More about this can be read here: Vertica Documentation; Purge Data.
An example of this from Vertica's Website: Update and Insert Simultaneously using MERGE
I agree that Merge is supported in Vertica version 6.0. But if Vertica's AHM or epoch management settings are set to save a lot of history (deleted) data, it will slow down your updates. The update speeds might go from what is bad, to worse, to horrible.
What I generally do to get rid of deleted (old) data is run the purge on the table after updating the table. This has helped maintain the speed of the updates.
Merge is useful where you definitely need to run updates. Especially incremental daily updates which might update millions of rows.
Getting to your answer: I don't think Vertica supportes Subquery in Merge. You would get the following.
ERROR 0: Subquery in MERGE is not supported
When I had a similar use-case, I created a view using the sub-query and merged into the destination table using the newly created view as my source table. That should let you keep using MERGE operations in Vertica and regular PURGEs should let you keep your updates fast.
In fact merge also helps avoid duplicate entries during inserts or updates if you use the correct combination of fields in ON clause, which should ideally be a join on the primary keys.
I like geoff's answer in general. It seems counterintuitive, but you'll have better results creating a new table with the rows you want in it versus modifying an existing one.
That said, doing so would only be worth it once the table gets past a certain size, or past a certain number of UPDATEs. If you're talking about a table <1mil rows, I might chance it and do the updates in place, and then purge to get rid of tombstoned rows.
To be clear, Vertica is not well suited for single row updates but large bulk updates are much less of an issue. I would not recommend re-creating the entire table, I would look into strategies around recreating partitions or bulk updates from staging tables.

Resources