MemSQL: Load data with "skip duplicate key" option is extremely slow - load-data-infile

I am evaluating the loading performance of Singlestore 7.6.10.
I tested two ways of loading both are important to real world practice:
loading to skip duplicated primary keys
load data local infile '/opt/orders.tbl' skip duplicate key errors into table ORDERS fields terminated by '|' lines terminated by '|\n' max_errors 0;
loading to replace duplicated primary keys with latest records
load data local infile '/opt/orders.tbl' replace into table orders_sf1_col columns terminated by '|';
Before running the tests, I guessed both methods should have similar performance in terms of load time because both ways need to scan the primary key to lookup duplicated data. If there is any difference, probably the REPLACE method should take more time because it needs to delete the current record and insert the latest one for replacement.
But to my surprise, loading with SKIP runs extremely slow and finished to load 163MB data file in almost 8 minutes. But the REPLACE loading with same file to same table can be finished in less than 15 seconds.
Both tests are run on same test environment (3 VMs) with same data file and load into the same target table. To simulate the duplicated conflicts, I ran two consecutive loads to an empty table and only measure the last one.
Question is why using skip duplicate key errors performs so slow and if there is a better way to achieve the same effect?
The DDL is here:
CREATE TABLE `orders_sf1_col` (
`O_ORDERKEY` int(11) NOT NULL,
`O_CUSTKEY` int(11) NOT NULL,
`O_ORDERSTATUS` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_TOTALPRICE` decimal(15,2) NOT NULL,
`O_ORDERDATE` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00.000000',
`O_ORDERPRIORITY` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_CLERK` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_SHIPPRIORITY` int(11) NOT NULL,
`O_COMMENT` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_NOP` varchar(79) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY `PRIMARY` (`O_ORDERKEY`) USING HASH,
KEY `ORDERS_FK1` (`O_CUSTKEY`) USING HASH,
KEY `ORDERS_DT_IDX` (`O_ORDERDATE`) USING HASH,
SHARD KEY `__SHARDKEY` (`O_ORDERKEY`) USING CLUSTERED COLUMNSTORE
) AUTOSTATS_CARDINALITY_MODE=INCREMENTAL AUTOSTATS_HISTOGRAM_MODE=CREATE AUTOSTATS_SAMPLING=ON SQL_MODE='STRICT_ALL_TABLES'
Thanks

Skip is more resource intensive function because it utilizes clustered index scan that's why it was taking more time.
On the other hand,
Replace utilizes less resources of the server because it uses clustered index seek
Which reduces the execution time with a noticeable difference.
But Singlestore latest version (7.8) has better results please go thru the official documentation.

Related

Can shard key in MemSQL have NULLs?

What are the rules regarding shard key and key for clustered columnStore ?
I need to make a column as Shard key and also for Clustered columnStore, but it may contain Nulls
What will be the impact of keeping a Nullable column as Shard key ?
I have already tested out the data load using this column and on a high-level, everything looks good for the first batch, but will it break anything while writing or reading down the line ?
CREATE TABLE test (
name varchar(25) DEFAULT NULL,
ID int(11) DEFAULT NULL,
update_date date DEFAULT NULL,
SHARD KEY (update_date) USING CLUSTERED COLUMNSTORE
)
NULL values are allowed in shard keys and columnstore keys, and function like any other value - so if you define shard key on a column with some NULLs then all null values will be placed on the same partition.

Keycloak starts very slow when offline_user_session table has many records

We have something around a million users in our app. We use Keycloak instance (https://hub.docker.com/r/jboss/keycloak/) with standard configuration.
Also we use Offline tokens, to which this problem is connected somehow
The more offline_user_session table has records, the more time it takes to start the keycloak instance up.
if it has 0 records, a start takes something about 30 seconds.
When it has 800 000 sessions, it takes 8 minutes to start
And when it has around 1 000 000 session, it can start for 30 minutes or more
I tried to find anything on the internet and looked up in the official documentation, but still no results.
When analyzing a user_session offline table, there is a problem related to good practices in defining this table.
Field
Type
Null
Key
Default
Extra
USER_SESSION_ID
varchar(36)
NO
PRI
NULL
USER_ID
varchar(255)
YES
REALM_ID
varchar(36)
NO
CREATED_ON
int(11)
NO
MUL
OFFLINE_FLAG
varchar(4)
NO
PRI
DATA
longtext
YES
LAST_SESSION_REFRESH
int(11)
NO
0
In the database layer, we have a construction problem, when using two columns with “varchar” types composing a primary key. This situation is the only reason for the problem presented, which is only evident in scenarios with many records.
The solution to this problem will be to exchange the primary key for a large integer data type, and leave it as columns (USER_SESSION_ID, OFFLINE_FLAG) as a unique index. However, this goes to the configuration adjustments at the application layer, which should be seen with the solution provider.
Field
Type
Null
Key
Default
Extra
ID
BIGINT
NO
PRIL
USER_SESSION_ID
varchar(36)
NO
UNI
NULL
USER_ID
varchar(255)
YES
REALM_ID
varchar(36)
NO
CREATED_ON
int(11)
NO
MUL
OFFLINE_FLAG
varchar(4)
NO
UNI
DATA
longtext
YES
LAST_SESSION_REFRESH
int(11)
NO
0

Unique constraint without index

let's say I have a large table.
This table not need to be queried, I just want to save the data inside for a while.
I want to prevent duplicates rows in the table, so I want to add an unique
constraint (or PK) on the table.
But the auto-created unique index is realy unnecessary.
I don't need it, and it's just wasting space in disk and require a maintenance
(regardless of the long time to create it).
Is there a way to create an unique constraint without index (any index - unique or nonunique)?
Thank you.
No, you can't have a UNIQUE constraint in Oracle without a corresponding index. The index is created automatically when the constraint is added, and any attempt to drop the index results in the error
ORA-02429: cannot drop index used for enforcement of unique/primary key
Best of luck.
EDIT
But you say "Let's say I have a large table". So how many rows are we talking about here? Look, 1TB SSD's are under $100. Quad-core laptops are under $400. If you're trying to minimize storage use or CPU burn by writing a bunch of code with minimal applicability to "save money" or "save time" my suggestion is that you're wasting both time and money. I repeat - ONE TERABYTE of storage costs the same as ONE HOUR of programmer time. A BRAND SPANKING NEW COMPUTER costs the same as FOUR LOUSY HOURS of programmer time. You are far, far better off doing whatever you can to minimize CODING TIME, rather than the traditional optimization targets of CPU time or disk space. Thus, I submit that the UNIQUE index is the low cost solution.
But the auto-created unique index is really unnecessary.
In fact, UNIQUEness in an Oracle Database is enforced/guaranteed via an INDEX. That's why your primary key constraints come with a UNIQUE INDEX.
Per the Docs
UNIQUE Key Constraints and Indexes
Oracle enforces unique integrity constraints with indexes.
Maybe Index-Organized Tables is what you need ?.
But strictly the index organized table is the table stored in the structure of the index - one can say that there is the index alone without the table, while yor requirement is to have the table without the index, so this is the opposite :)
CREATE TABLE some_name
(
col1 NUMBER(10) NOT NULL,
col2 NUMBER(10) NOT NULL,
col3 VARCHAR2(50) NOT NULL,
col4 VARCHAR2(50) NOT NULL,
CONSTRAINT pk_locations PRIMARY KEY (col1, col2)
)
ORGANIZATION INDEX

Postgres primary key 'less than' operation is slow

Consider the following table
CREATE TABLE COMPANY(
ID BIGINT PRIMARY KEY NOT NULL,
NAME TEXT NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR(50),
SALARY REAL
);
If we have 100 million random data in this table.
Select age from company where id=2855265
Executed in less than a millisecond
Select age from company where id<353
Return less than 50 rows and Executed in less than a millisecond
Both query uses index
But the following query use full table scan and executed in 3 seconds
Select age from company where id<2855265
Return less than 500 rows
How can I speed up the query that select primary key less than variable?
Performance
The predicate id < 2855265 potentially returns a large percentage of rows in the table. Unless Postgres has information in table statistics to expect only around 500 rows, it might switch from an index scan to a bitmap index scan or even a sequential scan. Explanation:
Postgres not using index when index scan is much better option
We would need to see the output from EXPLAIN (ANALYZE, BUFFERS) for your queries.
When you repeat the query, do you get the same performance? There may be caching effects.
Either way, 3 seconds is way to slow for 500 rows, Postgres might be working with outdated or inexact table statistics. Or there may be issues with your server configuration (not enough resources). Or there can be several other not so common reasons, including hardware issues ...
If VACUUM ANALYZE did not help, VACUUM FULL ANALYZE might. It effectively rewrites the whole table and all indexes in pristine condition. Takes an exclusive lock on the table and might conflict with concurrent access!
I would also consider increasing the statistics target for the id column. Instructions:
Keep PostgreSQL from sometimes choosing a bad query plan
Table definition?
Whatever else you do, there seem to be various problems with your table definition:
CREATE TABLE COMPANY(
ID BIGINT PRIMARY KEY NOT NULL, -- int is probably enough. "id" is a terrible column name
NAME TEXT NOT NULL, -- "name" is a teriible column name
AGE INT NOT NULL, -- typically bad idea to store age, store birthday instead
ADDRESS CHAR(50), -- never use char(n)!
SALARY REAL -- why would a company have a salary? never store money as real
);
You probably want something like this instead:
CREATE TABLE emmployee(
emploee_id serial PRIMARY KEY
company_id int NOT NULL -- REFERENCES company(company_id)?
, birthday date NOT NULL
, employee_name text NOT NULL
, address varchar(50) -- or just text
, salary int -- store amount as *Cents*
);
Related:
How to implement a many-to-many relationship in PostgreSQL?
Any downsides of using data type "text" for storing strings?
You will need to perform a VACUUM ANALYZE company; to update the planning.

Do I need a primary key with an Oracle identity column?

I am using Oracle 12c and I have an IDENTITY column set as GENERATED ALWAYS.
CREATE TABLE Customers
(
id NUMBER GENERATED ALWAYS AS IDENTITY,
customerName VARCHAR2(30) NULL,
CONSTRAINT "CUSTOMER_ID_PK" PRIMARY KEY ("ID")
);
Since the ID is automatically from a sequence it will be always unique.
Do I need a PK on the ID column, and if yes, will it impact the performance?
Would an index produce the same result with a better performance on INSERT?
No, you don't need a primary key necessarily, but you should always provide the optimiser as much information about your data as possible - including a unique constraint whenever possible.
In the case of a surrogate key (like your ID), it's almost always appropriate to declare it as a Primary Key, since it's the most likely candidate for referential constraints.
You could use an ordinary index on ID, and performance of lookups will be comparable depending on data volume - but there is virtually no good reason in this case to use a non-unique index instead of a unique index - and there is no good reason in this case to avoid the constraint which will require an index anyway.
Yes, you always should have a Uniqueness constraint (with rare exceptions) on an ID column if indeed it is (and should be) unique, regardless of the method by which it is populated - whether the value is provided by your application code or via an IDENTITY column.

Resources