I have to do a bit complicated data import. I need to do a number of UPDATEs which currently updating over 3 million rows in one query. This query is applying about 30-45 sec each (some of them even 4-5 minutes). My question is, whether I can speed it up. Where can I read something about it, what kind of indexes and on which columns I can set to improve those updates. I don't need exacly answer, so I don't show the tables. I am looking for some stuff to learn about it.
Two things:
1) Post an EXPLAIN ANALYZE of your UPDATE query.
2) If your UPDATE does not need to be atomic, then you may want to consider breaking apart the number of rows affected by your UPDATE. To minimize the number of "lost rows" due to exceeding the Free Space Map, consider the following approach:
BEGIN
UPDATE ... LIMIT N; or some predicate that would limit the number of rows (e.g. WHERE username ilike 'a%';).
COMMIT
VACUUM table_being_updated
Repeat steps 1-4 until all rows are updated.
ANALYZE table_being_updated
I suspect you're updating every row in your table and don't need all rows to be visible with the new value at the end of a single transaction, therefore the above approach of breaking the UPDATE up in to smaller transactions will be a good approach.
And yes, an INDEX on the relevant columns specified in the UPDATE's predicate will help will dramatically help. Again, post an EXPLAIN ANALYZE if you need further assistance.
If by a number of UPDATEs you mean one UPDATE command to each updated row then the problem is that all the target table's indexes will be updated and all constraints will be checked at each updated row. If that is the case then try instead to update all rows with a single UPDATE:
update t
set a = t2.b
from t2
where t.id = t2.id
If the imported data is in a text file then insert it in a temp table first and update from there. See my answer here
Related
I want to scrub(or encrypt) the email information from a few tables which are older than a few years.
This I am planning to do as part of a job, next time when I run the job how can I omit the rows which are already scrubbed or encrypted.
I am looking for an approach which will be having good performance.
"I want to scrub(or encrypt) the email information from a few tables which are older than a few years"
I hope this means you have a date column on these tables which you can use to determine which ones need to be scrubbed. The most efficient way of tackling the job is to track that date in an operational table, recording the most recent date scrubbed.
For example you have ten years' worth of data, and you need to scrub records which are more than four years old. Now this would work:
update t23
set email = null
where date_created < add_months(sysdate, -48);
But it seems like you want to batch things up. So build a tracking table, which at its simplest would be
create table tracker (
last_date_scrubbed);
Populate the last_date_scrubbed with a really old date say date '2010-01-01'
Now you can write a query like this
update t23
set email = null
where date_created
< (select last_date_scrubbed + interval '1' year from tracker);
That will clean all records older than 2011. Increment the date in the tracker table by one year. Run the query again to clean stuff from 2011. Repeat until you get to your target state of cleanliness. At which point you can switch to running the query monthly , with an interval of one month , or whatever.
Obviously you should proceduralize this. A procedure is the best way to encapsulate the steps and make sure everything is kept in step. Also you can use the database scheduler to run the procedure.
"there is one downside to this approach. I thought that you want to be free upon choosing which rows to be updated."
I don't see any requirement to track which individual rows have been scrubbed. After all, the end state is that every record older than a certain date has been scrubbed. When I have done jobs like this previously all anybody wanted to know was, "how many rows have we done so far and how many have we still got to do?" Which can be answered by tracking the sql%rowcount for each run.
For The best performance, you can add a Flag Column to your main table. a Column like IsEncrypted. then every time you try to run any query for the "not Encrypted rows" you easily use WHERE when IsEncrypted Column is false to condition on those rows only. there are other ways though.
EDIT
another way is to create a logger table. basically what this table does, is that it records any more information you want about a certain ID in another table. have another table called EncryptionLogger, in it you would have at least two columns: EmailTableId, IsEncrypted. then in any query you can simply get any rows WHERE their Ids are NOT IN this table.
In Cassandra, a row can be very long and store units of time relevant data. For example, one row could look like the following:
RowKey: "weather"
name=2013-01-02:temperature, value=90,
name=2013-01-02:humidity, value=23,
name=2013-01-02:rain, value=false",
name=2013-01-03:temperature, value=91,
name=2013-01-03:humidity, value=24,
name=2013-01-03:rain, value=false",
name=2013-01-04:temperature, value=90,
name=2013-01-04:humidity, value=23,
name=2013-01-04:rain, value=false".
9 columns of 3 days' weather info.
time is a primary key in this row. So the order of this row would be time based.
My question is, is there any way for me to do a query like: what is the last/first day's humidity value in this row? I know I could use a Order By statement in CQL but since this row is already sorted by time, there should be some way to just get the first/last one directly, instead of doing another sort. Or is cassandra optimizing it already with Order By statement under the hood?
Another way I could think of is, store another column in this row called "last_time_stamp" that always updates itself as new data is inserted in. But that would require one more update every time I insert new weather data.
Thanks for any suggestion!:)
Without seeing more of your actual table, I suggest using a timestamp (or timeuuid if there is a possibility for collisions) as the second component in a compound primary key. Using this, you can get the last "row" by selecting ORDER BY t DESC LIMIT 1.
You could also change the clustering order in your schema to order it naturally for "last N" queries.
Please see examples and linked resource in this answer.
I have a PowerPivot table for which I need to be able to determine how long an item was in an Error state. My data set looks something like this:
What I need to be able to do is to look at the values in the ID and State columns, and see if the value in the previous row is ERROR in the State column, and the same in the ID column. If it is, I then need to calculate the difference between the Changed Date values in those two rows.
So, for example, when I got to row 4, I would see that the value in the State column for Row 3, the previous row, is ERROR, and that the value in the ID column in the previous row is the same as the current row, so I would then calculate the difference between the Changed Date values in Row 3 and Row 4 (I don't care about the values in any of the other columns for this particular requirement).
Is there a way to do this in PowerPivot? I've done a fair amount of Internet searching, and it looks like if it can be done, it would use the EARLIER or EARLIEST DAX functions, but I can't find anything that tells me how, or even if, this can be done.
Thanks.
Chris,
I have had similar requirements many times and after a really long time of trial-and-error, I finally understood how EARLIER works. It can be very powerful, but also very slow so always check for the performance of your calculations.
To answer your question, you will need to create 4 calculated columns:
1) Item Rank - used for ranking the issues with same Item ID
=COUNTROWS(FILTER('ID', EARLIER([Item ID]) = [Item ID] && EARLIER([Date]) >= [Date]))
2) Follows Error - to easily find issue that follows EROR issue
=IF([State] = "EROR",[Item Rank]+1)
3) Time of Following Issue - simple lookup so that you can calculate the different
=IF([Follows Error]>0,
LOOKUPVALUE([Date], [User], [User], [Item Rank], [Follows Error]),
BLANK()
)
4) Time Diff - calculation of time different for the specific issue
=IF([State]="EROR",
DAY([Time of Following Issue])-DAY([Date]),
BLANK()
)
With those calculated columns, you can then easily create a powerpivot table, drag State and Item Id onto the ROWS pane and then simply add Time Diff to Values. You will get an overview of issues that contain string "EROR" issue and the time it took to resolve them.
This is what it looks like in PowerPivot window:
And the resulting Pivot table:
You can download my Excel file here (2013).
As I mentioned, be careful with the performance as the calculated columns with nested EARLIER and IF conditions might be a bit too performance-demanding. If there is a smarter way, I would be very happy to see it, but for now this works for me just fine.
Also, keep in mind that all calculated columns could be nested into 1, but I kept them separated to make it easier to understand the formulas.
Hope this helps :-)
newparts_calc
if (([MonthToDateQuery].[G/L Account] = 4200 and [Query1].[G_L_Group] = 'NEW')) THEN ([Credit Amount]-[Debit Amount]) ELSE (0)
Data Item1
total([newparts_calc])
I need Data Item1 to return newparts_calc values only.
So for example in 1st row Data Item1 should be 8,540.8, but is 34,163.2
Whats wrong? how do i fix?
REVISED QUESTION
I apologize for not making sense on the original question.
I have many of the calc's that im trying to gather and put on a crosstab. I want to see sales by month (row) and part category (column)
[Query2] is the one shown in picture above.
It joins [MonthToDateQuery] AND [Query1]
The join is on 'Invoice' and carnality is 1..1 = 1..1
[MonthToDateQuery] is based on the package im working in. General ledger. It supplies the g/l entries for each sales g/l account
[Query1] is a SQL query i brought in to be able to break out categories even further from g/l group.
For example g/l account 4300 is rebuilt. However i needed to break out even further to see Rebuilt-Production and Rebuilt-New. I can do that with the g/l group.
I saw in my g/l account ledger entries that it referenced the invoice number. So thats how i tied in my SQL.
So as you can see from the table below (which is the view tabular data from query) i need a total. I have tried plugging newparts_calc into my crosstab and setting aggregation to total but the numbers still dont seem right. I dont think i have something set as it should be.
All the calc's im doing are based on single or multiple G/L Accounts and single or multiple G/L Groups.
Any Advice?
As you can see the problem seems to be duplicate invoice numbers.
How can i fix?
Couple things come to mind:
-Set the processing order to 2
-Since your calc is always a multiple and you are joining two queries, you may need to check your cardinality. Sometimes it helps to add derived queries to ensure you are working with the correct grain.
I'm obviously missing something, but if you want
I need Data Item1 to return newparts_calc values only.
just use newparts_calc, without total? That would give you proper value for row 1 -)
If you need a running-total for days (sum of values for previous days) — you should use a running_total function.
At a guess, one of your two queries is returning multiple rows for each invoice, which will cause this double counting. Look at the output of the two queries and see if that's happening. If so, then you just need to work out how to collapse that down to one row per invoice.
Per your new question - The underlying data has got to be causing the issue. Its clearly not 1:1 (note that even though this is what your stated cardinality is, Cognos does not enforce 1:1). Invoice number is not unique, GL Group is at a lower level.
I'm using table with a counter to ensure unique id's on a child element.
I know it is usually better to use a sequence, but I can't use it because I have a lot of counters (a customer can create a couple of buckets and each of them needs to have their own counter, they have to start with 1 (it's a requirement, my customer needs "human readable" keys).
I'm creating records (let's call them items) that have a prikey (bucket_id, num = counter).
I need to guarantee that the bucket_id / num combination is unique (so using a sequence as prikey won't fix my problem).
The creation of rows doesn't happen in pl/sql, so I need to claim the number (btw: it's not against the requirements to have gaps).
My solution was:
UPDATE bucket
SET counter = counter + 1
WHERE id = param_id
RETURNING counter INTO num_forprikey;
PL/SQL returns var_num_forprikey so the item record can be created.
Question:
Will I always get unique num_forprikey even if the user concurrently asks for new items in a bucket?
Will I always get unique num_forprikey
even if the user concurrently asks for
new items in a bucket?
Yes, at least up to a point. The first user to issue that update gets a lock on the row. So no other user can successfully issue that same statement until user numero uno commits (or rolls back). So uniqueness is guaranteed.
Obviously, the cavil is regarding concurrency. Your access to the row is serialized, so there is no way for two users to get a new PRIKEY simultaneously. This is not necessarily a problem. It depends on how many users you have creating new Items, and how often they do it. One user peeling off numbers in the same session won't notice a thing.
I seem to recall this problem from many years back working on of all things an INGRES database. There were no sequences in those days so a lot of effort was put into finding the best scaling solution for this problem by the top INGRES minds of the day. I was fortunate enough to be working along side them so that even though my mind is pitifully smaller than any of theirs, proxmity = residual affect and I retained something. This was one of the things. Let me see if I can remember.
1) for each counter you need row in a work table.
2) each time you need a number
a) lock the row
b) update it
c) get its new value (you use returning for this which I avoid like the plague)
d) commit the update to release your lock on the row
The reason for the commit is for trying to get some kind of scalability. There will always be a limit but you do not serialize on getting a number for any period of time.
In the oracle world we would improve the situation by using a function defined as an AUTONOMOUS_TRANSACTION in order to acquire the next number. IF you think about it, this solution requires that gaps be allowed which you said is OK. By commiting the number update independently of the main transaction, you gain scalability but you introduce gapping.
You will have to accept the fact that your scalability will drop dramatically in this scenario. This is due to at least two reasons:
1) the update/select/commit sequence does its best to reduce the time during which the KEY row is locked, but it is still not zero. Under heavy load, you will serialize and eventually be limited.
2) you are commiting on every key get. A commit is an expensive operation requiring many memory and file management actions on the part of the database. This will limit you also.
In the end you are likely looking at three or more orders of magnitude drop in concurrent transaction load because you are not using sequences. I base this on my experience of the past.
But if you customer requires it, what can you do right?
Good luck. I have not tested the code for syntax errors, I leave that to you.
create or replace function get_next_key (key_name_p in varchar2) return number is
pragma autonomous_transaction;
kev_v number;
begin
update key_table set key = key + 1 where key_name = key_name_p;
select key_name into key_name_v from key_name where key_name = key_name_p;
commit;
return (key_v);
end;
/
show errors
You can still use sequences, just use the row_number() analytic function to please your users. I described it here in more detail: http://rwijk.blogspot.com/2008/01/sequence-within-parent.html
Regards,
Rob.
I'd figure out how to make sequences work. It's the only guarantee, though an exception clause could be coded
http://www.orafaq.com/forum/t/83382/0/ The benefit to sequences (and they could be dynamically created, is you can specify nocache and guarantee order)