How to change the read/write ratio of sysbench oltp_read_write script - sysbench

I'm using the sysbench(1.1.0) to test the performance of MySQL, and I want to test the scene that read:write is 95%:5%. Are there any parameters to change the read/write ratio?

You can have a look at the oltp_read_write script:
function event()
if not sysbench.opt.skip_trx then
begin()
end
execute_point_selects()
if sysbench.opt.range_selects then
execute_simple_ranges()
execute_sum_ranges()
execute_order_ranges()
execute_distinct_ranges()
end
execute_index_updates()
execute_non_index_updates()
execute_delete_inserts()
if not sysbench.opt.skip_trx then
commit()
end
check_reconnect()
end
This script define the query in a transaction, and the options are below(sysbench oltp_read_write.lua help):
oltp_read_write.lua options:
--auto_inc[=on|off] Use AUTO_INCREMENT column as Primary Key (for MySQL), or its alternatives in other DBMS. When disabled, use client-generated IDs [on]
--create_secondary[=on|off] Create a secondary index in addition to the PRIMARY KEY [on]
--create_table_options=STRING Extra CREATE TABLE options []
--delete_inserts=N Number of DELETE/INSERT combinations per transaction [1]
--distinct_ranges=N Number of SELECT DISTINCT queries per transaction [1]
--index_updates=N Number of UPDATE index queries per transaction [1]
--mysql_storage_engine=STRING Storage engine, if MySQL is used [innodb]
--non_index_updates=N Number of UPDATE non-index queries per transaction [1]
--order_ranges=N Number of SELECT ORDER BY queries per transaction [1]
--pgsql_variant=STRING Use this PostgreSQL variant when running with the PostgreSQL driver. The only currently supported variant is 'redshift'. When enabled, create_secondary is automatically disabled, and delete_inserts is set to 0
--point_selects=N Number of point SELECT queries per transaction [10]
--range_selects[=on|off] Enable/disable all range SELECT queries [on]
--range_size=N Range size for range SELECT queries [100]
--reconnect=N Reconnect after every N events. The default (0) is to not reconnect [0]
--secondary[=on|off] Use a secondary index in place of the PRIMARY KEY [off]
--simple_ranges=N Number of simple range SELECT queries per transaction [1]
--skip_trx[=on|off] Don't start explicit transactions and execute all queries in the AUTOCOMMIT mode [off]
--sum_ranges=N Number of SELECT SUM() queries per transaction [1]
--table_size=N Number of rows per table [10000]
--tables=N Number of tables [1]
So, the default read/write ratio is 10+4/1+1+2+2, and you can change the read/write ratio just by changing the option number. For example:
sysbench oltp_read_write.lua --range_selects=off --point_selects=100 xxx
Then the read/write ratio is 100+4/1+1+2+2.

Related

Import massive table from Oracle to PostgreSQL with oracle-fdw return ORA-01406

I work on a project to transfer data from an Oracle database to a PostgreSQL database to create a datawarehouse with bash & SQL scripts. To access to the Oracle database, I use the PostgreSQL extension oracle-fdw.
One of my scripts import data from a massive table (~ 100 000 000 new rows/day). This table is partitioned and each partition contains 1 day of data. The query I use to import data looks like that :
INSERT INTO postgre_target_table (some_fields)
SELECT some_aggregated_fields -- (~150 fields)
FROM oracle_source_table
WHERE partition_id = :v_partition_id AND some_others_filters
GROUP BY primary_key;
On DEV server, the query works fine (there is much less data on this server) but in PREPROD, it returns the error ORA-01406: fetched column value was truncated.
In some posts, people say that the output fields may be too small but if I try to send a simple SELECT query without INSERT or GROUP BY I have the same error.
Another idea I found in another post is to create an Oracle side view but in my query I use multiple parameters that I cannot use in a view.
The last idea I found is to create an Oracle stored procedure that fills a table with aggregated data and then import data from this table but the Oracle database is critical and my customer prefers to avoid adding more data on it.
Now, I'm starting to think there's no solution and it's not good...
PostgreSQL version : 12.4 / Oracle version : 11.2
UPDATE
It seems my problem is more complecated than I thought.
After applying the modification given by Laurenz Albe, the query runs correctly on PGAdmin but the problem still appears when I use psql command.
Moreover, another query seems to have the same problem. This other query does not use the same source table as the first query, it uses 4 joined tables without any partition. The common point between these queries is the structure.
The detail I omit to specify in the original post is that the purpose of both queries is to pivot a table. They look like that :
SELECT osr.id,
MIN(CASE osr.category
WHEN 123 THEN
1
END) AS field1,
MIN(CASE osr.category
WHEN 264 THEN
1
END) AS field2,
MIN(CASE osr.category
WHEN 975 THEN
1
END) AS field3,
...
FROM oracle_source_table osr
WHERE osr.category IN (123, 264, 975, ...)
GROUP BY osr.id;
Now that I have detailed what the queries look like, I can give you some results I had with the second one without changing the value of max_long (this query is lighter than the first one) :
Sometimes it works (~10%), sometimes it failed (~90%) on PGadmin but it never works with psql command
If I delete the WHERE, it always works
I don't understand why deleting the WHERE change something, the field used in this clause is a NUMBER(6, 0) between 0 and 2500 and it is still used in the SELECT clause... Oh and in the 4 Oracle tables used by this query, there is no LONG datatype, only NUMBER datatype is used.
Among 20 queries I have, only these two have a problem, their structure is similar and I don't believe in coincidences.
Don't despair!
Set the max_long option on the foreign table big enough that all your oversized data fit.
The documentation has the details:
max_long (optional, defaults to "32767")
The maximal length of any LONG, LONG RAW and XMLTYPE columns in the Oracle table. Possible values are integers between 1 and 1073741823 (the maximal size of a bytea in PostgreSQL). This amount of memory will be allocated at least twice, so large values will consume a lot of memory.
If max_long is less than the length of the longest value retrieved, you will receive the error message
ORA-01406: fetched column value was truncated
Example:
ALTER FOREIGN TABLE my_tab OPTIONS (ADD max_long '1000000');

ODI 12C - Business Key auto increment

Using ODI 12C, I Have a dimension with a combination of 2 Primary Keys. I Just want to perform a business key with auto-increment .
NOTE : - I Don't need a SCD behavior
I did this :
1 - Create sequence in Database Schema ( with no cache )
2 - Call it in the the needed mapping :
<%=odiRef.getObjectName("L", "SEQ_NAME", "D")%>.nextval
3 - Set the mapping to be :
Active for inserts only, uncheck NOT NULL condition, EXCUTE ON HINT :TARGET
4 - PUT CKM TO : CKM SQL
5 - Run
You can create a sequence in your project or in the global objects. There are three types of it:
Standard sequence, where the last value will be stored in the repository and ODI will increment it each time it retrieves a value.
Specific sequence, same thing as above except that you choice in which table in which schema the value is stored
Native sequence, which just use an underlying database sequence. This is the most popular choice as it will be a lot better for performance.
You can then call it using :<SEQUENCE_NAME>_NEXTVAL. For a native sequence it will get a new value for each row. For a standard or specific sequence it will give a new value for each row processed by the agent. So if you do row-by-row operation, it will give a new value for each row, but if you do batch operation, it will use the same value everywhere.
You can also call it using #<SEQUENCE_NAME>_NEXTVAL but this one will be substitued only once, before pushing the SQL on the database so all rows will have the same value.
If you use a native sequence on an Oracle database, you will have to set the Execute On Hint to Target so the sequence call is in the outermost select.

How much data is considered "too large" for a Hive MAPJOIN job?

EDIT: added more file size details, and some other session information.
I have a seemingly straightforward Hive JOIN query that surprisingly requires several hours to run.
SELECT a.value1, a.value2, b.value
FROM a
JOIN b ON a.key = b.key
WHERE a.keyPart BETWEEN b.startKeyPart AND B.endKeyPart;
I'm trying to determine if the execution time is normal for my dataset and AWS hardware selection, or if I am simply trying to JOIN too much data.
Table A: ~2.2 million rows, 12MB compressed, 81MB raw, 4 files.
Table B: ~245 thousand rows, 6.7MB compressed, 14MB raw, one file.
AWS: emr-4.3.0, running on about 5 m3.2xlarge EC2 instances.
Records from A always matches one or more records in B, so logically I see that at most 500 billion rows are generated before they are pruned with the WHERE clause.
4 mappers are allocated for the job, which completes in 6 hours. Is this normal for this type of query and configuration? If not, what should I do to improve it?
I've partitioned B on the JOIN key, which yields 5 partitions, but haven't noticed a significant improvement.
Also, the logs show that the Hive optimizer starts a local map join task, presumably to cache or stream the smaller table:
2016-02-07 02:14:13 Starting to launch local task to process map join; maximum memory = 932184064
2016-02-07 02:14:16 Dump the side-table for tag: 1 with group count: 5 into file: file:/mnt/var/lib/hive/tmp/local-hadoop/hive_2016-02-07_02-14-08_435_7052168836302267808-1/-local-10003/HashTable-Stage-4/MapJoin-mapfile01--.hashtable
2016-02-07 02:14:17 Uploaded 1 File to: file:/mnt/var/lib/hive/tmp/local-hadoop/hive_2016-02-07_02-14-08_435_7052168836302267808-1/-local-10003/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (12059634 bytes)
2016-02-07 02:14:17 End of local task; Time Taken: 3.71 sec.
What is causing this job to run slowly? The data set doesn't appear too large, and the "small-table" size is well under the "small-table" limit of 25MB that triggers the disabling of the MAPJOIN optimization.
A dump of the EXPLAIN output is copied on PasteBin for reference.
My session enables compression for output and intermediate storage. Could this be the culprit?
SET hive.exec.compress.output=true;
SET hive.exec.compress.intermediate=true;
SET mapred.output.compress=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
SET io.seqfile.compression.type=BLOCK;
My solution to this problem is to express the JOIN predicate entirely within the JOIN ON clause, as this is the most efficient way to execute a JOIN in Hive. As for why the original query was slow, I believe that the mappers just need time when scanning the intermediate data set row by row, 100+ billion times.
Due to Hive only supporting equality expressions in the JOIN ON clause and rejecting function calls that use both table aliases as parameters, there is no way to rewrite the original query's BETWEEN clause as an algebraic expression. For example, the following expression is illegal.
-- Only handles exclusive BETWEEN
JOIN b ON a.key = b.key
AND sign(a.keyPart - b.startKeyPart) = 1.0 -- keyPart > startKeyPart
AND sign(a.keyPart - b.endKeyPart) = -1.0 -- keyPart < endKeyPart
I ultimately modified my source data to include every value between startKeyPart and endKeyPart in a Hive ARRAY<BIGINT> data type.
CREATE TABLE LookupTable
key BIGINT,
startKeyPart BIGINT,
endKeyPart BIGINT,
keyParts ARRAY<BIGINT>;
Alternatively, I could have generated this value inline within my queries using a custom Java method; the LongStream.rangeClosed() method is only available in Java 8, which is not part of Hive 1.0.0 in AWS emr-4.3.0.
Now that I have the entire key space in an array, I can transform the array to a table using LATERAL VIEW and explode(), rewriting the JOIN as follows.
WITH b AS
(
SELECT key, keyPart, value
FROM LookupTable
LATERAL VIEW explode(keyParts) keyPartsTable AS keyPart
)
SELECT a.value1, a.value2, b.value
FROM a
JOIN b ON a.key = b.key AND a.keyPart = b.keyPart;
The end result is that the above query takes approximately 3 minutes to complete, when compared with the original 6 hours on the same hardware configuration.

why does the optimizer choose the higher cost execution plan?

This is a re-occuring Problem for me. I have statements that work well for a while and after a while the optimizer decides to choose another execution plan. This even happens for when I query for exactly one (composite) primary key.
When I look up the execution plan in dba_hist_sql_plan, it shows me costs of 20 for the query using the primary key index and costs of 270 for the query doing a full table scan.
plan_hash_value Operation Options Cost Search_Columns
2550672280 0 SELECT STATEMENT 20
2550672280 1 PARTITION HASH SINGLE 20
2550672280 2 TABLE ACCESS BY LOCAL INDEX ROWID 20
2550672280 3 INDEX RANGE SCAN 19 1
3908080950 0 SELECT STATEMENT 270
3908080950 1 PARTITION HASH SINGLE 270
3908080950 2 TABLE ACCESS FULL 270
I already noticed that the optimizer only uses the first column in the Primary key index and then does a range scan. But my real question is: Why does the optimizer choose the higher cost execution plan? It's not that both executions plans are used at the same time, I notice a switch within one snapshot and then it stays like that for several hours/days. So it can't be an issue of bind peeking.
Our current solution is that I call our DBA and he flushes the Statement Cache. But this is not really sustainable.
EDIT:
The SQL looks something like this: select * from X where X.id1 = ? and X.id2 = ? and X.id3 = ?
with (id1,id2,id3) being the composite primary key (with a unique index) on the table.
Maybe it's related to one bug on Oracle 11g.
Bug 18377553 : POOR CARDINALITY ESTIMATE WITH HISTOGRAMS AND VALUES > 32 BYTES
When your data is like :
AAAAAAAAAAAAAAAAAAAAmyvalue
AAAAAAAAAAAAAAAAAAAAsomeohtervalue
AAAAAAAAAAAAAAAAAAAAandsoon
B1234
Histograms do not work well.
The solution is disabling histograms on primary key and all will start working smoothly.
Most likely clustering factor and blevel of the index could be very high. Check the blevel by querying dba_indexes. If blevel is greater than 3 try rebuilding the index.
Also check whether the index created for primary key is unique or not. As per the plan it is using range scan instead of unique scan. Most likely the index is not unique.
Apparently the optimizer doesn't correctly display costs regarding type conversions. The root cause for this Problem was incorrect type mapping for a date value. While the column in the database is of type DATE, the JDBC type was incorrectly java.sql.Timestamp. To compare a DATE column with a Timestamp search parameter, all values in the table need to be transferred to Timestamp first. Which is additional cost and renders an index unusable.

after alter system flush shared_pool low performance Oracle

We did refactoring and replaced 2 similar requests with parameterized request
a.isGood = :1
after that request that used this parameter with parameter 'Y' was executed longer that usually (become almost the same with parameter 'N'). We used alter system flush shared_pool command and request for parameter 'Y' has completed fast (as before refactoring) while request with parameter 'N' hangs for a long time.
As you could understand number of lines in data base with parameter 'N' much more then with 'Y'
Oracle 10g
Why it happened?
I assume that you have an index on that column, otherwise the performance would be the same regardless of the Y/N combination. I have seen this happening quite bit on 10g+ due to Oracle's optimizer Bind Peeking combined to histograms on columns with skewed data distribution. The histograms get created automatically when one gathers tables statistics using the parameter method_opt with 'FOR ALL COLUMNS SIZE AUTO' (among other values). Oracle optimizes the query for the value in the bind variables provided in the very first execution of that query. If you run the query with Y the first time, Oracle might want to use an index instead of a full table scan, since Y will return a small quantity of rows. The next time you run the query with N, then Oracle will repeat the first execution plan, which happens to be a poor choice for N, since it will return the vast majority of rows.
The execution plans are cached in the SGA. Once you flush it, you get a brand new execution plan the very first time the query runs again.
My suggestion is:
Obtain the explain plan of both original queries (one with a hardcoded Y and one with a hardcode N). Investigate if the two plans use different indexes or one has a much higher Cost than the other. I have the feeling that one uses a full table scan and the other uses an index. The first one should be faster for N and the second should be faster for Y.
Try to remove the statistics on the table and see if it makes a difference on the query that has the bind variable. Later you need to gather statistics again for the table or other queries on that table might suffer.
You can also gather statistics for that one table using method_opt => FOR ALL COLUMNS SIZE 1. That will keep the statistics without the histograms on any columns of that table.
A bitmap index on this column might fix the issue as well. Indexes on a column that have only two possible values (Y and N) are not exactly very efficient.
If column isGood has 99,000 'N' values and 1,000 'Y' values and you run with the condition isGood = 'Y', then it may be appropriate to use an index to find the results: you are returning 1% of the rows. If you run the query with the condition isGood = 'N', a full table scan would be more appropriate since you are returning most of the table anyway. If you were to use an index for the N condition, you would be doing an extra index lookup for every data item lookup.
Although the general rule is that bind parameters are good, it can be problematic in this kind of instance if really two different plans are required for the query. With the bind parameter scenario:
SELECT * FROM x WHERE isGood = :1
The statement will be parsed and a plan computed and saved in the sql cache. The same plan will be used for both query scenarios which is not desirable. But:
SELECT * FROM x WHERE isGood = 'Y'
SELECT * FROM x WHERE isGood = 'N'
will result in two plans being stored in the sql cache, hopefully each with the appropriate plan for the query. Version 11g avoids this problem with adaptive cursor sharing, which can use different plans for different bind variable values.
You need to look at your plans (EXPLAIN PLAN) to see what is happening in your case. Flush the cache, try one method, examine the plan; try the other, examine the plan. It might give you an idea what is happening in your case. There are a bunch of other topics you might follow up on that may help, for example:
using a hint to force the use of an index
cursor_sharing parameter
histograms on statistics

Resources