SQL query suddenly runs slowly until base table's stats are gathered - oracle

We have a scheduled job that loads table A from multiple tables twice a day. During that procedure, there is a one separate single select statement that gets two columns from table B. That is:
select column1, column2
into l_col1, l_col2
from Table B
where ....
The job starts, goes through several selects in a procedure, and when it comes to the above select it halts, in a session window we see that above select is holding the session and when we gather stats for that Table B, it solves the problem, the query runs fast in seconds.
However, this problem appears again when the job runs for the second time in a day. The same thing happens even though Table B has not changed, no transactions have been made to it. We gather statistics again to fasten the query. We tried to add a job to gather stats for it before the main job starts but it did not help. Anybody know why this happens?

Related

multiple insert into a table using Apache Spark

I am working on a project and i am stuck on following scenario.
I have a table: superMerge(id, name, salary)
and I have 2 other tables: table1 and table2
all the tables ( table1, table2 and superMerge) has same structure.
Now, my challenge is to insert/update superMerge table from table1 and table2.
table1 is updated every 10mins and table2 every 20 mins therefore at time t=20mins i have 2 jobs trying to update same table(superMerge in this case.)
I want to understand how can i acheive this parallel insert/update/merge into superMerge table using Spark or any other hadoop application.
The problem here is that the two jobs can't communicate with each other, not knowing what the other is doing. A relatively easy solution whould be to implement a basic file-based "locking" system:
Each job creates a (empty) file in a specific folder on HDFS indicating that the update/insert is in progress and removes that file if the jobs is done
Now, each jobs has to check whether such a file exists or not prior to starting the update/insert. If it exists, the job must wait until the files is gone.
Can you control code of job1 & job2? How do you schedule those?
In general you can convert those two jobs into 1 that runs every 10 minutes. Once in 20 mins this unified job runs with different mode(merging from 2 tables), while default mode will be to merge from 1 table only.
So when you have same driver - you don't need any synchronisation between two jobs(e.g. locking). This solution supposes that jobs are finishing under 10 mins.
How large are your dataset ? Are you planning to do it in Batch (Spark) or could you stream your inserts / updates (Spark Streaming) ?
Lets assume you want to do it in batch:
Launch only one job every 10 minutes that can process the two tables. if you got Table 1 and Table 2 do a Union and join with superMerge. As Igor Berman suggested.
Be careful has your superMerge table will get bigger your join will take longer.
I faced this situation, write the tb1 DF1 to a location1 and tb2 DF2 to location 2 and at the end just switch the paths to the super merge table, you can also do the table to table insert but that consumes a lot of runtimes especially in the hive.
overwriting to the staging locations location1 and location 2:
df1.write.mode("overwrite").partitionBy("partition").parquet(location1)
df2.write.mode("overwrite").partitionBy("partition").parquet(location2)
switching paths to super merge table :
hiveContext.sql(alter table super_merge_table add if not exists partition(partition=x); LOAD DATA INPATH 'location1/partition=x/' INTO TABLE super_merge_table partition(partition=x))"
hiveContext.sql(alter table super_merge_table add if not exists partition(partition=x); LOAD DATA INPATH 'location2/partition=x/' INTO TABLE super_merge_table partition(partition=x))"
You can do the parallel merging without overriding the one on other.

Oracle: troubles using the parallel option in CREATE TABLE

I am not an expert of oracle, I just write queries and scripts through TOAD but I don't know how oracle works.
In order to make a data analysis I have created several tables in order to build a DB from them.
In order to build them I have used the following command
CREATE TABLE SCHEMA.NAME
TABLESPACE SCHEMASPACE NOLOGGING PARALLEL AS
select [...]
with the parallel option active in order to make the server work on more processors.
Once that the DB was created I have tried to do some queries on it but the same query done on the same DB 2 times gave me two different answers. In detail the first time I started the query it returned to me a table with about 2500 rows and the second time it returned to me a table with about 2650 rows.
I have tried to redo the entire process without the parallel option and now it seems to work, but it take too much time and I need to run the process several times, does anyone have a solution?

Hive Locks entire database when running select on one table

HIVE 0.13 will SHARED lock the entire database(I see a node like LOCK-0000000000 as a child of the database node in Zookeeper) when running a select statement on any table in the database. HIVE creates a shared lock on the entire schema even when running a select statement - this results in a freeze on CREATE/DELETE statements on other tables in the database until the original query finishes and the lock is released.
Does anybody know a way around this? Following link suggests concurrency to be turned off but we can't do that as we are replacing the entire table and we have to make sure that no select statement is accessing the table before we replace the entire contents.
http://mail-archives.apache.org/mod_mbox/hive-user/201408.mbox/%3C0eba01cfc035$3501e4f0$9f05aed0$#com%3E
use mydatabase;
select count(*) from large_table limit 1; # this table is very large and hive.support.concurrency=true`
In another hive shell, meanwhile the 1st query is executing:
use mydatabase;
create table sometable (id string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE ;
The problem is that the “create table” does not execute untill the first query (select) has finished.
Update:
We are using Cloudera's distribution of Hive CDH-5.2.1-1 and we are seeing this issue.
I think they never made such that in Hive 0.13. Please verify your Resource manager and see that you have enough memory when you are executing multiple Hive queries.
As you know each Hive query will trigger a map reduce job and if YARN doesn't have enough resources it will wait till the previous running job completes. Please approach your issue from memory point of view.
All the best !!

Lock Table taking more time to execute update statement oracle

We have a batch process which reads the base tables and performs some aggregation and then update the tables with an modified flag.
We have an update statement which updates around 3million rows.As a part of the business requirement we need to have table-level lock on the table which we are updating.
UPDATE TABLE1 t1 SET PARAMETER1=(SELECT p1 from TABLE2 t2 where t1.ROW_ID=ROWIDTOCHAR(t2.ROW_ID)
The observation today we made is that, update statement with table level lock is taking 35 mins while without table level lock is taking 20 mins.
I am not able to ascertain this observation. Please help!
Cheers,
Dwarak
Nobody but your database could tell you the reason of your observation. You'll have to do an AWR report.
However, it's not quite possible that the UPDATE would run longer because the table had been locked before.
Did you account for caching (both in the database and the filesystem) in your testing? Depending on what you did when, one statement might have run faster due to data already being in memory.

oracle and creating history

I am working on a system to track a project's history. There are 3 main tables: projects, tasks, and clients then 3 history tables for each. I have the following trigger on projects table.
CREATE OR REPLACE TRIGGER mySchema.trg_projectHistory
BEFORE UPDATE OR DELETE
ON mySchema.projects REFERENCING NEW AS New OLD AS Old
FOR EACH ROW
declare tmpVersion number;
BEGIN
select myPackage.GETPROJECTVERSION( :OLD.project_ID ) into tmpVersion from dual;
INSERT INTO mySchema.projectHistiry
( project_ID, ..., version )
VALUES
( :OLD.project_ID,
...
tmpVersion
);
EXCEPTION
WHEN OTHERS THEN
-- Consider logging the error and then re-raise
RAISE;
END ;
/
I got three triggers for each of my tables (projects, tasks, clients).
Here is the challenge: Not everything changes at the same time. For example, somebody could just update a certain tasks' cost. In this case, only one trigger fires and I got one insert. I'd like to insert one record into 3 history tables at once even if nothing changed in the projects and clients tables.
Also, what if somebody changes a project's end_date, the cost, and say the picks another client. Now, I have three triggers firing at the same time. Only in this case, I will have one record inserted into my three history tables. (which I want)
If i modify the triggers to do insert into 3 tables for the first example, then I will have 9 inserts when the second example happens.
Not quite sure how to tackle this. any help?
To me it sounds as if you want a transaction-level snapshot of the three tables created whenever you make a change to any of those tables.
Have a row level trigger on each of the three tables that calls a single packaged procedure with the project id and optionally client / task id.
The packaged procedure inserts into all three history tables the relevant project, client and tasks where there isn't already a history record for that key and transaction (ie you don't want duplicates). You got a couple of choices when it comes to the latter. You can use a unique constraint and either a BULK select and insert with FORALL/SAVE EXCEPTIONS, DML error logging (EXCEPTIONS INTO) or a INSERT...SELECT...WHERE NOT EXISTS...
You do need to keep track of your transactions. I'm guessing this is what you were doing with myPackage.GETPROJECTVERSION. The trick here is to only increment versions when you have a new transaction. If, when you get a new version number, you hold it in a pacakge level variable, you can easily tell whether your session has already got a version number or not.
If your session is going to run multiple transaction, you'll need to 'clear' out the session-level version number if it was part of a previous transaction. If you get DBMS_TRANSACTION.LOCAL_TRANSACTION_ID and store that at the package/session level as well, you can determine if you are in a new transaction, or part of the same transaction.
From your description, it looks like you would be capturing the effective and end date for each of the history rows once any of the original rows change.
Eg. Project_hist table would have eff_date and exp_date which has the start and end date for a given project. Project table would just have an effective date. (as it is the active project).
I don't see why you want to insert rows for all three history tables when only one of the table values is updated. You can pretty much get the details as you need (as of a given date) using your current logic. (inserting old row in the history table for the table that has been updated only.).
Alternative answer.
Have a look at Total Recall / Flashback Archive
You can set the retention to 10 years, and use a simple AS OF TIMESTAMP to get the data as of any particular timestamp.
Not sure on performance though. It may be easier to have a daily or weekly retention and then a separate scheduled job that picks out the older versions using the VERSIONS BETWEEN syntax and stores them in your history table.

Resources