Sorry if this is a dumb question but do i need to reindex my table every time i insert rows, or does the new row get indexed when added?
From the manual
Once an index is created, no further intervention is required: the system will update the index when the table is modified
http://postgresguide.com/performance/indexes.html
I think when you insert rows, the index does get updated. It maintains the sort on the index table as you insert data. Hence there are performance issues or downtimes on a table, if you try adding large number of rows at once.
On top of the other answers: PostgreSQL is a top notch Relational Database. I'm not aware of any Relational Database system where indices are not updated automatically.
It seems to depend on the type of index. For example, according to https://www.postgresql.org/docs/9.5/brin-intro.html, for BRIN indexes:
When a new page is created that does not fall within the last summarized range, that range does not automatically acquire a summary tuple; those tuples remain unsummarized until a summarization run is invoked later, creating initial summaries. This process can be invoked manually using the brin_summarize_new_values(regclass) function, or automatically when VACUUM processes the table.
Although this seems to have changed in version 10.
Related
I have a table in Hbase named 'xyz' . When I do an update operation on this table , it updates a table even though it is same record .
How can I control second record to not be added.
Eg:
create 'ns:xyz',{NAME=>'cf1',VERSIONS => 5}
put 'ns:xyz','1','cf1:name','NewYork'
put 'ns:xyz','1','cf1:name','NewYork'
Above put statements are giving 2 records with different timestamp if I check all versions. I am expecting that it should not add 2nd record because it have same value
HBase isn't going to look through the entire row and work out if it's the same as the data you're adding. That would be an expensive operation, and HBase prides itself on its fast insert speeds.
If you're really eager to do this (and I'd ask if you really want to do this), you should perform a GET first to see if the data is already present in the table.
You could also write a Coprocessor to do this every time you PUT data, but again the performance would be undesirable.
As mentioned by #Ben Watson, HBase is best known for it's performance in write since it doesn't need to check for the existence of a value as multiple versions will be maintained by default.
One hack what you can do is, you can use custom versioning. As show in the below screenshot, you have two versions already for a row key. Now if you are going to insert the same record with the same timestamp. HBase would be overwriting the same record with just the value.
NOTE: It is left to your application to get the same timestamp for a particular value.
I have a table(in oracle) size about 860 million records (850gb) on top we are getting about 2 -3 million records as source (flatfile).
we are doing a lookup on target if record already exist it will update if it is a new record it will insert(scd1).
The transformations we using are unconnectedlookup, sorter, filter and router, update strategy transformations, it was fine all this time, but as the table is huge and growing huge, it is taking for ever to insert and update, last night it took 19 hrs to 2.4 million records (2.1 millions were new so inserted and the rest are updates).
Today I got about 1.9 millions to go through i am not sure how long it will take any suggestions or help how can we handle this ?
1) Use just a connected lookup to oracle table, after SQ matching on primary key and filter out nulls (records missing in Oracle table) or not null (updates). Dont check for other columns for update. Skip sorter and filter. Just use update strategy.
2) Or use joiner and make flat file pipeline as master. Then check for nulls to find insert or updates.
3) Check if your target table dont have any trigger etc on it. If yes then check its logic and implement it in ETL.
Since you are dealing with 850mil data, you have two major bottlenecks - target lookup and writing into target.
You can think of this strategy -
Mapping 1 - Create a new mapping to load flat file data into a temp table TMP1.
Mapping 2 - Modify existing mapping. Just modify lookup query and join TMP1 and target (860mil)table in SQL Override. This will reduce time, I/O, lookup cache.
Also, please make sure you have an index on key columns in target. And you drop-create all other index while loading. Skipping sorter will help but adding joiner will not help much.
Regards,
Koushik
How many inserts vs updates do you have?
With just a few updates, try using Update else Insert target
property.
If there are many updates and few inserts, perform update
just if a key is found, without checking if anything has changed
If there are many source rows matching what you already have (i.e. an update that doesn't change anything) try to eliminate them. But don't compare all columns - use a hash instead. Just create an additional computed column that will contain a MD5 calculated on all columns. Then all you need to do is compare one column instead of all to detect a change.
1) Try using a merge statement if source and targets are in same database.
2) We can also use sql loader connection to improve the performance.
Clearly the bottleneck is in the target lookup and target load (update to be specific).
Try the following to tune the existing code:
1) Try to remove any unwanted lookup ports if you have in the lookup transformation. Keep only the fields that are used in the lookup condition as you are using it just to check if the record exists.
2) Try adding an index to the target table for the fields you are using for the update
3) Increase the commit interval of the session to a higher value.
4) Partial Pushdown optimization:
You can pushdown some of the processing to database which might be faster instead of doing it in Informatica
Create a staging table to hold the incoming data for that load.
Create a mapping to load the incoming file to the staging table. Truncate it before the start of the load to clear the records of the previous run.
In the SQL override of the existing mapping do a left join between the staging table and target table to find insert/updates. This will be faster than the Informatica lookup and eliminates the time taken to build the Informatica lookup cache.
5) Using MD5 to eliminate unwanted updates
For using MD5 you need to add a new field in the target table and do a mapping to update the existing records one time.
Then in your existing mapping add a step to compute MD5 for the incoming column.
If the record is identified for update then check if the MD5 computed for the incoming column is same as that of the target column. If the checksum also matches then don't update the record. Only if the check sum is different update the record. By this way you will filter out the unwanted updates. If there is no lookup match then insert the record.
Advantages: You are reducing the unwanted updates.
Disadvantages: You have to do an one time process to populate MD5 values for the existing records in the table.
If none of this works check with your database administrator to see if there is any issue in the database side that might slow down the load.
We are having a database design where we have table on which we have 1 Day Interval Partitioning on the column named as "5mintime" and on the same column we have created index also.
"5mintime" column can have data such as 1-Mar-2011,2-Mar-2011, in short there is no time component in it and from the UI also the user can select only one day period as minimum date.
My question is that while firing the select queries is there any advantage gained because of indexes since the partition is already there, on the flip side if i remove the indexes the insertion will be come faster, so any help on this would be greatly appreciated.
If I understand you right, then I think there's no need for the index:
A local index is indexed for every partition, which in your case has the same value in all rows (ie: 1-Mar-2011 in the 1-Mar-2011 partition, 2-Mar-2011 in the 2-Mar-2011 partiotion and so on).
A global Index will actually index the whole table but will find a whole partiotion, which is also not usefull since you already have partiones ...
But, why not check it?
If each day's data goes into its own partition and you can never search within days, but only for entire days worth of data, then, no, I don't see this index adding any value.
You can confirm whether or not SQL queries are using this index by enabling monitoring:
alter index myindex monitoring usage;
And then check to see if it's been used by querying v$object_usage for it some time later.
Am workin in oracle 11g. Does ALTER INDEX REBUILD ONLINE render indexes invalid when executed parallely?
A new index is built in parallel and while the rebuild is in progress the old index is still available. However, when the new index is swapped in for the old index there will be a period (hopefully short) when neither index is available.
From http://www.oracle-base.com/articles/9i/HighAvailabilityEnhancements9i.php#OnlineIndexRebuilds:
When the ONLINE keyword is used as
part of the CREATE or ALTER syntax
the current index is left intact while
a new copy of the index is built,
allowing DML to access the old index.
Any alterations to the old index are
recorded in a Index Organized Table
known as a "journal table". Once the
rebuild is complete the alterations
from the journal table are merged into
the new index. This may take several
passes depending on the frequency of
alterations to the index. The process
will skip any locked rows and commit
every 20 rows. Once the merge
operation is complete the data
dictionary is updated and the old
index is dropped. DML access is only
blocked during the data dictionary
updates, which complete very quickly.
I watched an online rebuild with toad and executed a query that used the specified index.
Don't see it becoming invalid.
Maybe for a very short time at the precise moment the 'new' index is activated.
Haven't found any docs about it yet though.
In Oracle are indexes used when an UPDATE is fired without a WHERE clause
By "used", do you mean "referred to" or "modified"?
An UPDATE without a WHERE clause boils down to an iteration over the entire table; I see no good reason why Oracle should refer to an index in this case, as there's no benefit to be had from that. (Although that's little more than a qualified guess.) nonnb is right that the index will be affected depending on what column you touch.
If you update affects indexed columns, then the index pages will need to be replaced as well.
Will Oracle use the index to find the rows being updated? With no where clause, almost certainly not.
Will Oracle have to read one or more indexes, getting blocks in consistent mode to update them? If you're updating any columns that are indexed, have function-based indexes which will result in an updated indexed value, or cause row movement among partitions, then yes, indexes "will be used."