Amazon Redshift, getting slower in every update run - performance

I am beginning with Amanzon Redshift.
I have just loaded a big table, millions of rows and 171 fields. The data quality is poor, there are a lot of characters that must be removed.
I have prepare updates for every column, since redshift stores in column mode, I suppose it is faster by column.
UPDATE MyTable SET Field1 = REPLACE(Field1, '~', '');
UPDATE MyTable SET Field2 = REPLACE(Field2, '~', '');
.
.
.
UPDATE MyTable set FieldN = Replace(FieldN, '~', '');
The first 'update' took 1 min. The second one took 1 min and 40 sec...
Every time a run one of the updates, it takes more time than the last one. I have run 19 and the last one took almost 25 min. The time consumed by every 'update' increases one after another.
Another thing is that with the first update, the cpu utilization was minimal, now with the last update it is taking 100%
I have a 3-nodes cluster of dc1.large instances.
I have rebooted the cluster but the problem continues.
Please, I need orientation to find the cause of this problem.

When you update a column, Redshift actually deletes all those rows and inserts new rows with the new value. So there are bunch of space that needs to be reclaimed. So you need to VACUUM your table after the update.
They also recommend that you run ANALYZE after each update to update statistics for the query planner.
http://docs.aws.amazon.com/redshift/latest/dg/r_UPDATE.html

A more optimal way might be
Create another identical table.
Read N ( say 10000) rows at a time from first table, process and load into the second table using s3
loading (instead of insert).
Delete first table and rename second table
If are running into space issues, delete the N migrated rows after every iteration from the first table and run vacuum delete only <name_of_first_table>
Refrences
s3 loading : http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html
copy table from 's3://<your-bucket-name>/load/key_prefix' credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>' options;

Related

Speed up updates on Oracle DB which a lot of records

I have to update table which has around 93 mln records, at the beginning DB updated 10 k records per 5 seconds, now after around 60 mln updated records, update next 10k records take 30-60 s, don't know why I have to update columns which are null.
I use loop with commit each 10 k records:
LOOP
UPDATE TABLE
SET DATE_COLUMN = v_hist_date
WHERE DATE_COLUMN IS NULL
AND ROWNUM <= c_commit_limit
AND NOT_REMOVED IS NULL;
EXIT WHEN SQL%ROWCOUNT = 0;
COMMIT;
END LOOP;
Have you any ideas why it slow down so much and how is possible to speed up this update ?
Updates are queries too. You haven't posted an explain plan but given you are filtering on columns which are null it seems probable that your statement is executing a Full Table Scan. That certainly fits the behaviour you describe.
What happens is this. The first loop the FTS finds 10000 rows which fit the WHERE criteria almost immediately. Then you exit the loop and start again. This time the FTS reads the same blocks again, including the ones it updated in the previous iteration before it finds the next 10000 rows it can update. And so on. Each loop takes longer because the full table scan has to read more of the table for each loop.
This is one of the penalties of randomly committing inside a loop. It may be too late for you now, but a better approach would be to track an indexed column such as a primary key. Using such a tracking key will allow an index scan to skip past the rows you have already visited.

Insert into ElasticSearch using Hive/Qubole

I am trying to insert data into elastic search from a hive table.
CREATE EXTERNAL TABLE IF NOT EXISTS es_temp_table (
dt STRING,
text STRING
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource'='aggr_2014-10-01/metric','es.index.auto.create'='true')
;
INSERT OVERWRITE TABLE es_temp_table
SELECT dt, description
FROM other_table
However, the data is off. When I do a count(*) on my other table I am getting 6,000 rows. When I search the aggr_2014-10-01 index, I see 10,000 records! Somehow, the records are being duplicated (rows are being copied over multiple times). Maybe I can remove duplicate records inside of elastic search? Not sure how I would do that though.
I believe it might be a result of Hive/Qubole spawning two tasks for every mapping. If one mapper succeeds, it tries to kill the other. However, the other task already did damage (aka inserted into ElasticSearch). This is my best guess, but I would prefer to know the exact reason and if there is a way for me to fix it.
set mapred.map.tasks.speculative.execution=false;
One thing I found was to set speculative execution to false, so that only one task is spawned per mapper (see above setting). However, now I am seeing undercounting. I believe this may be due to records being skipped, but I am unable to diagnose why those records would be skipped in the first place.
In this version, it also means that if even one task/mapper fails, the entire job fails, and then I need to delete the index (partial data was uploaded) and rerun the entire job (which takes ~4hours).
[PROGRESS UPDATE]
I attempted to solve this by putting all of the work in the reducer (it's the only way to only spawn one task to ensure no duplicate record insertions).
INSERT OVERWRITE TABLE es_temp_table
SELECT dt, description
FROM other_table
DISTRIBUTE BY cast(rand()*250 as int);
However, I now see a huge underestimate! 2,000 records only now. Elastic search does estimate some things, but not to this extent. There are simply less records in ElasticSearch. This may be due to failed tasks (that are no longer retrying). It may be from when Qubole/Hive passes over malformed entries. But I set:
set mapreduce.map.skip.maxrecords=1000;
Here are some other settings for my query:
set es.nodes=node-names
set es.port=9200;
set es.bulk.size.bytes=1000mb;
set es.http.timeout=20m;
set mapred.tasktracker.expiry.interval=3600000;
set mapred.task.timeout=3600000;
I determined the problem. As I suspected, insertion was skipping over some records that were considered "bad." I was never able to find what records exactly were being skipped, but I tried replacing all non-alphanumeric characters with a space. This solved the problem! The records are no longer being skipped, and all data is uploaded to Elastic Search.
INSERT OVERWRITE TABLE es_temp_table
SELECT dt, REGEXP_REPLACE(description, '[^0-9a-zA-Z]+', ' ')
FROM other_table

Best SQL DB design for temp storage of millions of records

I have a database table that collects records at the rate of about 4 records per/sec/device. This table gets pretty big pretty fast. When a device completes its task another process will loop through all the records, perform some operations, combine them into 5 minute chunks, compress them and store them for later use. Then it deletes all the records in that table for that device.
Right now there are nearly 1 million records for several devices. I can loop through them just fine to perform the processing, it appears, but when I try to delete them I time out. Is there a way to delete these records more quickly? Perhaps by turning off object tracking temporarily? Using some lock hint? Would the design be better to simply create a separate table for each device when it begins its task and then just drop it once processing of the data is complete? The timeout is set to 10 minutes. I would really like to get that process to complete within that 10 minute period if possible.
CREATE TABLE [dbo].[case_waveform_data] (
[case_id] INT NOT NULL,
[channel_index] INT NOT NULL,
[seconds_between_points] REAL NOT NULL,
[last_time_stamp] DATETIME NOT NULL,
[value_array] VARBINARY (8000) NULL,
[first_time_stamp] DATETIME NULL
);
CREATE CLUSTERED INDEX [ClusteredIndex-caseis-channelindex] ON [dbo]. [case_waveform_data]
(
[case_id] ASC,
[channel_index] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [NonClusteredIndex-all-fields] ON [dbo].[case_waveform_data]
(
[case_id] ASC,
[channel_index] ASC,
[last_time_stamp] ASC
)
INCLUDE ( [seconds_between_points],
[value_array]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON PRIMARY]
SQL Server 2008+ standard is the DB platform
UPDATE 3/31/2014:
I have started going down a path that seems to be problematic. Is this really all that bad?
I am creating a stored proc that takes a table-value parameter containing the data I want to append and a varchar parameter that contains a unique table name for the device. This stored proc is going to check for the existence of the table and, if it does not exist, create it with a specific structure. Then it will insert the data from the TVP. The problem I see is that I have to use dynamic sql in the SP as there seems to be no way to pass in a table name as a variable to either a CREATE or INSERT. Plus, every article I read on how to do this says not to...
Unfortunately, if I have a single table which is getting all the inserts at a frequency of 4/sec/device, just doing a count on the table for a specific case_id takes 17 minutes even with a clustered index on case_id and channel_index. So trying to delete them takes around 25 - 30 minutes. This also causes locking to occur and therefore the inserts start taking longer and longer which causes the service to get way behind. This even occurs when there is no deleting happening as well.
The described stored proc is designed to reduce the inserts from 4/sec/device to 1/sec/device as well as making it possible to just drop the table when done rather than deleting each record individually. Thoughts?
UPDATE 2 3/31/2014
I am not using cursors or any looping in the way you are thinking. Here is the code I use to loop through the records. This runs at an acceptable speed however:
using (SqlConnection liveconn = new SqlConnection(LiveORDataManager.ConnectionString))
{
using (SqlCommand command = liveconn.CreateCommand())
{
command.CommandText = channelQueryString;
command.Parameters.AddWithValue("channelIndex", channel);
command.CommandTimeout = 600;
liveconn.Open();
SqlDataReader reader = command.ExecuteReader();
// Call Read before accessing data.
while (reader.Read())
{
var item = new
{
//case_id = reader.GetInt32(0),
channel_index = reader.GetInt32(0),
last_time_stamp = reader.GetDateTime(1),
seconds_between_points = reader.GetFloat(2),
value_array = (byte[])reader.GetSqlBinary(3)
};
// Perform processing on item
}
}
}
The SQL I use to delete is trivial:
DELETE FROM case_waveform_data where case_id = #CaseId
This line takes 25+ minutes to delete 1 million rows
Sample data (value_array is truncated):
case_id channel_index seconds_between_points last_time_stamp value_array first_time_stamp
7823 0 0.002 2014-03-31 15:00:40.660 0x1F8B0800000000000400636060 NULL
7823 0 0.002 2014-03-31 15:00:41.673 0x1F8B08000000000004006360646060F80F04201A04F8418C3F4082DBFBA2F29E5 NULL
7823 0 0.002 2014-03-31 15:00:42.690 0x1F8B08000000000004006360646060F80F04201A04F8418C3F4082DBFB NULL
When deleting the large amount of data from Table, underneath SQL server marks those to delete and as a background job sql server will actually delete them from page as it gets some idle time. Also unlike Truncate; Delete is Log Enable.
If you would have Enterprise Edition, as other developers have suggested using Partition is the possible approach but you have standard edition.
Option:1 "Longer, Tedious, Subjective for 100% Performance"
lets say you keep the single Table approach. you can add a new column "IsProcessed" to indicate what records are already processed. when you insert new data it will have default value 0 so other processes consuming this data will now filter their query using this column as well. after processing you will need additional update on the table to mark those row as IsProcessed=1. Now you can create SQL Server JOB to delete top N rows where Isprocessed=1 and schedule that job as frequently as you can on ideal time slot. "TOP N" because you have to find out by try and error what is the best number for your environment. it may be 100, 1000, 10,000. in my experience if the number is smaller works best. increase the frequency of job execution. lets say "DELETE top 1000 From Table" takes 2 minutes. and you have 4 hours of clean window over night when this table is not being used, you can schedule this job to run every 5 minutes. 3 minutes is just buffer. and hence 12 exec/hour and 1000 rows per execution in 4 hours you will be deleting 48k rows from table. and then over Weekend you have larger window and you will have to catch up with remaining rows.
You can see in this approach lots of back and forth ad lot of minute details and yet it is not sure if this will be last for long for your needs in future. suddenly the input volume of data gets double and all your calculation will fall apart. another down side of this approach is consumer queries of the data will have to now relies on IsProcessed Column value. in your specific case consumer always read all data for a device so Indexing the table doesn't help you instead it will hurt the Insert process performance.
I did personalty experience this solution and last for 2 years in one of our client env.
Option:2 "Quick, Efficient, Make Sense to me, May Work for You"
Creating the One Table for device and as you mentioned using a stored procedure to create the table on the fly if does not exist. this is my most recent experience where we have Metadata driven ETL and all the ETL Target objects and API are getting created during run time based on user configuration. yes it is Dynamic SQL but if used wisely and once tested for performance it is not bad. The down side of this approach is debugging during initial phase if something not working. but in your case you know the table structures and it is fixed you are not dealing with daily change in table structure. That is why I think this is more suitable for your situation. Another thing is now you will have to also make sure TempDB is configured properly because using TVP and temp table increases the tempDB usage drastically so you initial and increment space assigned to tempdb, disk on which tempDB is located are two main thing to look at. As I said in Option-1, as the consumer proceses of the data always uses the ALL DATA I do not think you need any extra indexing in place. in fact I would test the performance w/o any index as well. it is like processing All Staging Data.
look at the sample code for this approch. If you feel positive and some doubt on inding or any other aspect of this approach lets us know.
Prepare the Schema objects
IF OBJECT_ID('pr_DeviceSpecificInsert','P') IS NOT NULL
DROP PROCEDURE pr_DeviceSpecificInsert
GO
IF EXISTS (
SELECT TOP 1 *
FROM sys.table_types
WHERE name = N'udt_DeviceSpecificData'
)
DROP TYPE dbo.udt_DeviceSpecificData
GO
CREATE TYPE dbo.udt_DeviceSpecificData
AS TABLE
(
testDeviceData sysname NULL
)
GO
CREATE PROCEDURE pr_DeviceSpecificInsert
(
#DeviceData dbo.udt_DeviceSpecificData READONLY
,#DeviceName NVARCHAR(200)
)
AS
BEGIN
SET NOCOUNT ON
BEGIN TRY
BEGIN TRAN
DECLARE #SQL NVARCHAR(MAX)=N''
,#ParaDef NVARCHAR(1000)=N''
,#TableName NVARCHAR(200)=ISNULL(#DeviceName,N'')
--get the UDT data into temp table
--because we can not use UDT/Table Variable in dynamic SQL
SELECT * INTO #Temp_DeviceData FROM #DeviceData
--Drop and Recreate the Table for Device.
BEGIN
SET #SQL ='
if object_id('''+#TableName+''',''u'') IS NOT NULL
drop table dbo.'+#TableName+'
CREATE TABLE dbo.'+#TableName+'
(
RowID INT IDENTITY NOT NULL
,testDeviceData sysname NULL
)
'
PRINT #SQL
EXECUTE sp_executesql #SQL
END
--Insert the UDT data in to actual table
SET #SQL ='
Insert INTO '+#TableName+N' (testDeviceData)
Select testDeviceData From #Temp_DeviceData
'
PRINT #SQL
EXECUTE sp_executesql #SQL
COMMIT TRAN
END TRY
BEGIN CATCH
ROLLBACK TRAN
SELECT ERROR_MESSAGE()
END CATCH
SET NOCOUNT OFF
END
Execute The sample Code
DECLARE #DeviceData dbo.udt_DeviceSpecificData
INSERT #DeviceData (testDeviceData)
SELECT 'abc'
UNION ALL SELECT 'xyz'
EXECUTE dbo.pr_DeviceSpecificInsert
#DeviceData = #DeviceData, -- udt_DeviceSpecificData
#DeviceName = N'tbl2' -- nvarchar(200)

How to delete large data from Oracle 9i DB?

I have a table that is 5 GB, now I was trying to delete like below:
delete from tablename
where to_char(screatetime,'yyyy-mm-dd') <'2009-06-01'
But it's running long and no response. Meanwhile I tried to check if anybody is blocking with this below:
select l1.sid, ' IS BLOCKING ', l2.sid
from v$lock l1, v$lock l2
where l1.block =1 and l2.request > 0
and l1.id1=l2.id1
and l1.id2=l2.id2
But I didn't find any blocking also.
How can I delete this large data without any problem?
5GB is not a useful measurement of table size. The total number of rows matters. The number of rows you are going to delete as a proportion of the total matters. The average length of the row matters.
If the proportion of the rows to be deleted is tiny it may be worth your while creating an index on screatetime which you will drop afterwards. This may mean your entire operation takes longer, but crucially, it will reduce the time it takes for you to delete the rows.
On the other hand, if you are deleting a large chunk of rows you might find it better to
Create a copy of the table using
'create table t1_copy as select * from t1
where screatedate >= to_date('2009-06-01','yyyy-mm-dd')`
Swap the tables using the rename command.
Re-apply constraints, indexs to the new T1.
Another thing to bear in mind is that deletions eat more UNDO than other transactions, because they take more information to rollback. So if your records are long and/or numerous then your DBA may need to check the UNDO tablespace (or rollback segs if you're still using them).
Finally, have you done any investigation to see where the time is actually going? DELETE statements are just another query, and they can be tackled using the normal panoply of tuning tricks.
Use a query condition to export necessary rows
Truncate table
Import rows
If there is an index on screatetime your query may not be using it. Change your statement so that your where clause can use the index.
delete from tablename where screatetime < to_date('2009-06-01','yyyy-mm-dd')
It runs MUCH faster when you lock the table first. Also change the where clause, as suggested by Rene.
LOCK TABLE tablename IN EXCLUSIVE MODE;
DELETE FROM tablename
where screatetime < to_date('2009-06-01','yyyy-mm-dd');
EDIT: If the table cannot be locked, because it is constantly accessed, you can choose the salami tactic to delete those rows:
BEGIN
LOOP
DELETE FROM tablename
WHERE screatetime < to_date('2009-06-01','yyyy-mm-dd')
AND ROWNUM<=10000;
EXIT WHEN SQL%ROWCOUNT=0;
COMMIT;
END LOOP;
END;
Overall, this will be slower, but it wont burst your rollback segment and you can see the progress in another session (i.e. the number of rows in tablename goes down). And if you have to kill it for some reason, rollback won't take forever and you haven't lost all work done so far.

Inserts are 4x slower if table has lots of record (400K) vs. if it's empty

(Database: Oracle 10G R2)
It takes 1 minute to insert 100,000 records into a table. But if the table already contains some records (400K), then it takes 4 minutes and 12 seconds; also CPU-wait jumps up and “Free Buffer Waits” become really high (from dbconsole).
Do you know what’s happing here? Is this because of frequent table extents? The extent size for these tables is 1,048,576 bytes. I have a feeling DB is trying to extend the table storage.
I am really confused about this. So any help would be great!
This is the insert statement:
begin
for i in 1 .. 100000 loop
insert into customer
(id, business_name, address1,
address2, city,
zip, state, country, fax,
phone, email
)
values (customer_seq.nextval, dbms_random.string ('A', 20), dbms_random.string ('A', 20),
dbms_random.string ('A', 20), dbms_random.string ('A', 20),
trunc (dbms_random.value (10000, 99999)), 'CA', 'US', '798-779-7987',
'798-779-7987', 'asdfasf#asfasf.com'
);
end loop;
end;
Here dstat output (CPU, IO, MEMORY, NET) for :
Empty Table inserts: http://pastebin.com/f40f50dbb
Table with 400K records: http://pastebin.com/f48d8ebc7
Output from v$buffer_pool_statistics
ID: 3
NAME: DEFAULT
BLOCK_SIZE: 8192
SET_MSIZE: 4446
CNUM_REPL: 4446
CNUM_WRITE: 0
CNUM_SET: 4446
BUF_GOT: 1407656
SUM_WRITE: 1244533
SUM_SCAN: 0
FREE_BUFFER_WAIT: 93314
WRITE_COMPLETE_WAIT: 832
BUFFER_BUSY_WAIT: 788
FREE_BUFFER_INSPECTED: 2141883
DIRTY_BUFFERS_INSPECTED: 1030570
DB_BLOCK_CHANGE: 44445969
DB_BLOCK_GETS: 44866836
CONSISTENT_GETS: 8195371
PHYSICAL_READS: 930646
PHYSICAL_WRITES: 1244533
UPDATE
I dropped indexes off this table and performance improved drastically even when inserting 100K into 600K records table (which took 47 seconds with no CPU wait - see dstat output http://pastebin.com/fbaccb10 ) .
Not sure if this is the same in Oracle, but in SQL Server the first thing I'd check is how many indexes you have on the table. If it's a lot the DB has to do a lot of work reindexing the table as records are inserted. It's more difficult to reindex 500k rows than 100k.
The indices are some form of tree, which means the time to insert a record is going to be O(log n), where n is the size of the tree (≈ number of rows for the standard unique index).
The fastest way to insert them is going to be dropping/disabling the index during the insert and recreating it after, as you've already found.
Even with indexes, 4 minutes to insert 100,000 records seems like a problem to me.
If this database has I/O problems, you haven't fixed them and they will appear again. I would recommend that you identify the root cause.
If you post the index DDL, I'll time it for a comparison.
I added indexes on id and business_name. Doing 10 iterations in a loop, the average time per 100,000 rows was 25 seconds. This was on my home PC/server all running on a single disk.
Another trick to improve performance is to turn on or set the cache higher on your sequence(customer_seq). This will allow oracle to allocate the sequence into memory instead of hitting the object for each insert.
Be careful with this one though. In some situations this will cause gaps your sequence to have gaps between values.
More information here:
Oracle/PLSQL: Sequences (Autonumber)
Sorted inserts always take longer the more entries there are in the table.
You don't say which columns are indexed. If you had indexes on fax, phone or email, you would have had a LOT of duplicates (ie every row).
Oracle 'pretends' to have non-unique indexes. In reality every index entry is unique with the rowid of the actual table row being the deciding factor. The rowid is made up of the file/block/record.
It is possible that, once you hit a certain number of records, the new ones were getting rowids which meant that had to be fitted into the middle of existing indexes with a lot of index re-writing going on.
If you supply full table and index creation statements, others would be able to reproduce the experience which would have allowed for more evidence based responses.
i think it has to do with the extending the internal structure of the file, as well as building database indexes for the added information - i believe the database arranges the data in a non-linear fashion that helps speed up data retrieval on selects

Resources