Persisting data between benchmarks using BenchmarkDotNet - benchmarkdotnet

I'm trying to benchmark two databases (different types, different locations).
My select benchmarks are working fine, but I'm having trouble with my inserts, updates and deletes.
I tried saving the key (GUID) I use for the insert in a class field of type Queue<string> but when my update benchmark is run this field is reset and thus empty, the same in my delete benchmark.
I don't want to call the delete statement after the insert statement in my insert benchmark or an insert statement in my delete benchmark because then the time results are off.
How to handle this situation?
I thought of creating a list of GUIDs in the [GlobalSetup] but when I change the number of iterations I need to increase or decrease this list.
Any advice will be much appreciated.

I've fixed this myself by saving the keys in a text file on GlobalCleanup and reading this file on GlobalSetup.

Related

Data cleanup in Oracle DB is taking long time for 300 billion records

Problem statement:
There is address table in Oracle which is having relationship with multiple tables like subscriber, member etc.
Currently design is in such a way that when there is any change in associated tables, it increments record version throughout all tables.
So new record is added in address table even if same address is already present, resulting into large number of duplicate copies.
We need to identify and remove duplicate records, and update foreign keys in associated tables while making sure it doesn't impact the running application.
Tried solution:
We have written a script for cleanup logic, where unique hash is generated for every address. If calculated hash is already present then it means address is duplicate, where we merge into single address record and update foreign keys in associated tables.
But the problem is there are around 300 billion records in address table, so this cleanup process is taking lot of time, and it will take several days to complete.
We have tried to have index for hash column, but process is still taking time.
Also we have updated the insertion/query logic to use addresses as per new structure (using hash, and without version), in order to take care of incoming requests in production.
We are planning to do processing in chunks, but it will be very long an on-going activity.
Questions:
Would like to if any further improvement can be made in above approach
Will distributed processing will help here? (may be using Hadoop Spark/hive/MR etc.)
Is there any some sort of tool that can be used here?
Suggestion 1
Use built-in delete parallel
delete /*+ parallel(t 8) */ mytable t where ...
Suggestion 2
Use distributed processing (Hadoop Spark/hive) - watch out for potential contention on indexes or table blocks. It is recommended to have each process to work on a logical isolated subset, e.g.
process 1 - delete mytable t where id between 1000 and 1999
process 2 - delete mytable t where id between 2000 and 2999
...
Suggestion 3
If more than ~30% of the table need to be deleted - the fastest way would be to create an empty table, copy there all required rows, drop original table, rename new, create all indexes+constraints. Of course it requires downtime and it greatly depends on number of indexes - the more you have the longer it will take
P.S. There are no "magic" tools to do it. In the end they all run the same sql commands as you can.
It's possible use oracle merge instruction to insert data if you use clean sql.

How to obtain count of update records from Hive

I am working with hive driver in which the executeUpdate() record does not return the affected record count. Is there any alternate way in which this can be obtained? We need to get the effected record value for further processing.
If I am not mistaken Hive does not show (or even know?!) the number of updated records. Therefore extracting this directly is likely not going to work.
Workaround
First run a count query using the exact where statement and log the result
Then do the actual update
Naturally this incurs significant overhead.

Best approaches to UPDATE the data in tables - Teradata

I am new to Teradata & fortunately got a chance to work on both DDL-DML statements.
One thing I observed is Teradata is very slow when time comes to UPDATE the data in a table having large number of records.
The simplest way I found on the Google to perform this update is to write an INSERT-SELECT statement with a CASE on column holding values to be update with new values.
But what when this situation arrives in Data Warehouse environment, when we need to update multiple columns from a table holding millions of rows ?
Which would be the best approach to follow ?
INSERT-SELECT only OR MERGE-UPDATE OR MLOAD ?
Not sure if any of the above approach is not used for this UPDATE operation.
Thank you in advance!
At enterprise level, we expect volumes to be huge and updates are often part of some scheduled jobs/scripts.
With huge volume of data, Updates comes as a costly operation that involve risk of blocking table for some time in case the update fails (due to fallback journal). Although scripts are tested well, and failures seldom happen in production environments, it's always better to have data that needs to be updated loaded to a temporary table in required form and inserted back to same table after deleting matching records to maintain SCD-1 (Where we don't maintain history).

MERGE in Vertica

I would like to write a MERGE statement in Vertica database.
I know it can't be used directly, and insert/update has to be
combined to get the desired effect.
The merge sentence looks like this:
MERGE INTO table c USING (select b.field1,field2 aeg from table a, table b
where a.field3='Y'
and a.field4=b.field4
group by b.field1) t
on (c.field1=t.field1)
WHEN MATCHED THEN
UPDATE
set c.UUS_NAIT=t.field2;
Would just like to see an example of MERGE being used as insert/update.
You really don't want to do an update in Vertica. Inserting is fine. Selects are fine. But I would highly recommend staying away from anything that updates or deletes.
The system is optimized for reading large amounts of data and for inserting large amounts of data. So since you want to do an operation that does 1 of the 2 I would advise against it.
As you stated, you can break apart the statement into an insert and an update.
What I would recommend, not knowing the details of what you want to do so this is subject to change:
1) Insert data from an outside source into a staging table.
2) Perform and INSERT-SELECT from that table into the table you desire using the criteria you are thinking about. Either using a join or in two statements with subqueries to the table you want to test against.
3) Truncate the staging table.
It seems convoluted I guess, but you really don't want to do UPDATE's. And if you think that is a hassle, please remember that what causes the hassle is what gives you your gains on SELECT statements.
If you want an example of a MERGE statement follow the link. That is the link to the Vertica documentation. Remember to follow the instructions clearly. You cannot write a Merge with WHEN NOT MATCHED followed and WHEN MATCHED. It has to follow the sequence as given in the usage description in the documentation (which is the other way round). But you can choose to omit one completely.
I'm not sure, if you are aware of the fact that in Vertica, data which is updated or deleted is not really removed from the table, but just marked as 'deleted'. This sort of data can be manually removed by running: SELECT PURGE_TABLE('schemaName.tableName');
You might need super user permissions to do that on that schema.
More about this can be read here: Vertica Documentation; Purge Data.
An example of this from Vertica's Website: Update and Insert Simultaneously using MERGE
I agree that Merge is supported in Vertica version 6.0. But if Vertica's AHM or epoch management settings are set to save a lot of history (deleted) data, it will slow down your updates. The update speeds might go from what is bad, to worse, to horrible.
What I generally do to get rid of deleted (old) data is run the purge on the table after updating the table. This has helped maintain the speed of the updates.
Merge is useful where you definitely need to run updates. Especially incremental daily updates which might update millions of rows.
Getting to your answer: I don't think Vertica supportes Subquery in Merge. You would get the following.
ERROR 0: Subquery in MERGE is not supported
When I had a similar use-case, I created a view using the sub-query and merged into the destination table using the newly created view as my source table. That should let you keep using MERGE operations in Vertica and regular PURGEs should let you keep your updates fast.
In fact merge also helps avoid duplicate entries during inserts or updates if you use the correct combination of fields in ON clause, which should ideally be a join on the primary keys.
I like geoff's answer in general. It seems counterintuitive, but you'll have better results creating a new table with the rows you want in it versus modifying an existing one.
That said, doing so would only be worth it once the table gets past a certain size, or past a certain number of UPDATEs. If you're talking about a table <1mil rows, I might chance it and do the updates in place, and then purge to get rid of tombstoned rows.
To be clear, Vertica is not well suited for single row updates but large bulk updates are much less of an issue. I would not recommend re-creating the entire table, I would look into strategies around recreating partitions or bulk updates from staging tables.

Oracle PL/SQL: choosing the update/merge column dynamically

I have a table with data relating to several moments in time that I have to keep updated. To save space and time, however, each row in my table refers to a given day and hourly and quarter-hourly data for that day are scattered throughout the several columns in that same row. When updating the data for a particular moment in time I, therefore, must choose the column that has to be be updated through some programming logic in my PL/SQL procedures and functions.
Is there a way to dynamically choose the column or columns involved in an update/merge operation without having to assemble the query string anew every time? Performance is a concern and the throughput must be high, so I can't do anything that would perform poorly.
Edit: I am aware of normalization issues. However I still would like to know a good way for choosing the columns to be updated/merged dynamically and programatically.
The only way to dynamically choose what column or columns to use for a DML statement is to use dynamic SQL. And the only way to use dynamic SQL is to generate a SQL statement that can then be prepared and executed. Of course, you can assemble the string in a more or less efficient manner, you can potentially parse the statement once and execute it multiple times, etc. in order to minimize the expense of using dynamic SQL. But using dynamic SQL that performs close to what you'd get with static SQL requires quite a bit more work.
I'd echo Ben's point-- it doesn't appear that you are saving time by structuring your table this way. You'll likely get much better performance by normalizing the table properly. I'm not sure what space you believe you are saving but I would tend to doubt that denormalizing your table structure is going to save you much if anything in terms of space.
One way to do what is required is to create a package with all possible updates (which aren't that many, as I'll only update one field at a given time) and then choosing which query to use depending on my internal logic. This would, however, lead to a big if/else or switch/case-like statement. Is there a way to achieve similar results with better performance?

Resources