I have an sqlite database used to store information about backup jobs. Each run, it increases approximately 25mb as a result of adding around 32,000 entries to a particular table.
This table is a "map table" used to link certain info to records in another table... and it has a primary key (autoincrement int) that I don't use.
sqlite will reserve 1, 2, 4, or 8 bytes for INT column depending on its value. This table only has 3 additional columns, also of INT type.
I've added indexes to the database on the columns that I use as filters (WHERE) in my queries.
In the presence of indexes, etc. and in the situation described, do primary keys have any useful benefit in terms of performance?
Note: Performance is very, very important to this project - but not if 10ms saved on a 32,000 entry job means an additional 10MB of data!
A primary key index is used to look up a row for a given primary key. It is also used to ensure that the primary key values are unique.
If you search your data using other columns, the primary key index will not be used, and as such will yield no performance benefit. Its mere existence should not have a negative performance impact either, though.
An unnecessary index wastes disk space, and makes INSERT and UPDATE statements execute slower. It should have no negative impact on query performance.
If you really don't use this id what don't you drop this column + primary key? The only reason to keep a non-used primary key id column alive is to make it possible to create a master-detail relation with another table.
Another possibility is to keep the column but to drop the primary key. That will mean that the application has to take care of providing a unique id with every insert statement. Before and after each batch operation you have to check whether this column is still unique. This doesn't work in for instance MySQL and Oracle because of multi concurrency issues but it does work in sqlite.
Related
Problem statement:
There is address table in Oracle which is having relationship with multiple tables like subscriber, member etc.
Currently design is in such a way that when there is any change in associated tables, it increments record version throughout all tables.
So new record is added in address table even if same address is already present, resulting into large number of duplicate copies.
We need to identify and remove duplicate records, and update foreign keys in associated tables while making sure it doesn't impact the running application.
Tried solution:
We have written a script for cleanup logic, where unique hash is generated for every address. If calculated hash is already present then it means address is duplicate, where we merge into single address record and update foreign keys in associated tables.
But the problem is there are around 300 billion records in address table, so this cleanup process is taking lot of time, and it will take several days to complete.
We have tried to have index for hash column, but process is still taking time.
Also we have updated the insertion/query logic to use addresses as per new structure (using hash, and without version), in order to take care of incoming requests in production.
We are planning to do processing in chunks, but it will be very long an on-going activity.
Questions:
Would like to if any further improvement can be made in above approach
Will distributed processing will help here? (may be using Hadoop Spark/hive/MR etc.)
Is there any some sort of tool that can be used here?
Suggestion 1
Use built-in delete parallel
delete /*+ parallel(t 8) */ mytable t where ...
Suggestion 2
Use distributed processing (Hadoop Spark/hive) - watch out for potential contention on indexes or table blocks. It is recommended to have each process to work on a logical isolated subset, e.g.
process 1 - delete mytable t where id between 1000 and 1999
process 2 - delete mytable t where id between 2000 and 2999
...
Suggestion 3
If more than ~30% of the table need to be deleted - the fastest way would be to create an empty table, copy there all required rows, drop original table, rename new, create all indexes+constraints. Of course it requires downtime and it greatly depends on number of indexes - the more you have the longer it will take
P.S. There are no "magic" tools to do it. In the end they all run the same sql commands as you can.
It's possible use oracle merge instruction to insert data if you use clean sql.
According to the official manual:
"The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and disk I/O overhead and should be avoided if not strictly needed. It is usually not needed."
So it is better not to use it? Do you have any benchmark of using the implicit rowid against using AUTOINCREMENT?
As recommended in the documentation, it is better to not use AUTOINCREMENT unless you need to ensure that the alias of the rowid (aka the id) is greater then any that have been added. However, (in normal use) it's a moot point as such, as even without AUTOINCREMENT, until you have reached 9223372036854775807 rows then a higher rowid/id will result.
If you do reach the id/rowid of 9223372036854775807, then that's it if you have AUTOINCREMENT coded, as an SQLITE_FULL exception will happen. Whilst without AUTOINCREMENT attempts will be made to get an unused id/rowid.
AUTOINCREMENT adds a row (table if required) to sqlite_sequence that records the highest allocated id. The difference between with and without AUTOINCREMENT is that the sqlite_sequecence table is referenced, whilst without AUTOINCREMENT the isn't. So if a row is deleted that has the highest id AUTOINCREMENT gets the highest ever allocated id from the sqlite_sequence table (and user the greater of that or max(rowid)), without doesn't so it uses the highest in the table where the row is being inserted (equivalent to max(rowid)).
With limited testing an overhead of 8-12% was found to be the overhead as per What are the overheads of using AUTOINCREMENT for SQLite on Android?
.
I have tried sqlite3 autoincrement with python3 and sqlalchemy 1.4.
Before enable autoincrement on Integer Primary Key ID column, single insert use about less than 0.1 seconds. After enable this feature, single insert use more than 1.5 seconds.
The performance gap is big.
I assume the answer is "no" in this scenario, but I figured I'd ask and see if there was something I was missing:
I have an Oracle table which is partitioned for ease of data loading -- data is loaded into six separate tables and then partition-switched into the main table. The only thing differentiating these loading tables is the source of the data, so each one has a unique datasource column which is used to partition the main table. We occasionally have some ad hoc queries which look at this datasource in the main table, but the standard reports querying this table ignore this column entirely. Nothing insert/update/deletes individual records from this table, so there's no concern about updating any indexes.
In this case, is there any reason to use local indexes instead of global ones?
A local index makes a lot of sense - if you use partitioning for performance reasons.
If your queries always contain the partition key then a Oracle will only scan that specific partition (that is known as "partition pruning").
If you then have additional conditions that would benefit from an index lookup, the database only needs to check the local index which is much smaller then a global index and thus the lookup will be faster.
In your case, if you never (or almost never) include the partition key in the queries, you are right that the local index wouldn't be helpful.
I'm creating the datamodel for a timeseries application on Cassandra 2.1.3. We will be preserving X amount of data for each user of the system and I'm wondering what is the best approach to design for this requirement.
Option1:
Use a 'bucket' in the partition key, so data for X period goes into the same row. Something like this:
((id, bucket), timestamp) -> data
I can delete a single row at once at the expense of maintaining this bucket concept. It also limits the range I can query on timestamp, probably resulting in several queries.
Option2:
Store all the data in the same row. N deletes are per column.
(id, timestamp) -> data
Range queries are easy again. But what about performance after many column deletes?
Given that we plan to use TTL to let the data expire, which of the two models would deliver the best performance? Is the tombstone overhead of Option1 << Option2 or will there be a tombstone per column on both models anyway?
I'm trying to avoid to bury myself in the tombstone graveyard.
I think it will all depend on how much data you plan on having for the given partition key you end up choosing, what your TTL is and what queries you are making.
I typically lean towards option #1, especially if your TTL is the same for all writes. In addition if you are using LeveledCompactionStrategy or DataTieredCompactionStrategy, Cassandra will do a great job keeping data from the same partition in the same SSTable, which will greatly improve read performance.
If you use Option #2, data for the same partition could likely be spread across multiple levels (if using LCS) or just in general multiple sstables, which may cause you to read from a lot of SSTables, depending on the nature of your queries. There is also the issue of hotspotting, where you could overload particular cassandra nodes if you have a really wide partition.
The other benefit of #1 (which you allude to), is that you can easily delete the entire partition, which creates a single tombstone marker which is much cheaper. Also, if you are using the same TTL, data within that partition will expire pretty much at the same time.
I do agree that it is a bit of a pain to have to make multiple queries to read across multiple partitions as it pushes some complexity into the application end. You may also need to maintain a separate table to keep track of the buckets for the given id if they can not be determined implicitly.
As far as performance goes, do you see it as likely that you will need to read cross-partitions when your application makes queries? For example, if you have a query for 'the most recent 1000 records' and a partition typically is wider than that, you may only need to make 1 query for Option #1. However, if you want to have a query like 'give me all records', Option #2 may be better as otherwise you'll need to a make queries for each bucket.
After creating the tables you described above:
CREATE TABLE option1 (
... id bigint,
... bucket bigint,
... timestamp timestamp,
... data text,
... PRIMARY KEY ((id, bucket), timestamp)
... ) WITH default_time_to_live=10;
CREATE TABLE option2 (
... id bigint,
... timestamp timestamp,
... data text,
... PRIMARY KEY (id, timestamp)
... ) WITH default_time_to_live=10;
I inserted a test row:
INSERT INTO option1 (id,bucket,timestamp,data) VALUES (1,2015,'2015-03-16 11:24:00-0500','test1');
INSERT INTO option2 (id,timestamp,data) VALUES (1,'2015-03-16 11:24:00-0500','test2');
...waited 10 seconds, queried with tracing on, and I saw identical tombstone counts for each table. So I either way that shouldn't be too much of a concern for you.
The real issue, is that if you think you'll ever hit the limit of 2 billion columns per partition, then Option #1 is the safe one. If you have a lot of data Option #1 might perform better (because you'll be eliminating the need to look at partitions that don't match your bucket), but really either one should be fine in that respect.
tl;dr;
As the issues of performance and tombstones are going to be similar no matter which option you choose, I'm thinking that Option #2 is the better one, just due to ease of querying.
I have inherited a datababase with tables that lack primary keys. It's an OLTP database. One of the tables in question has ~300k records, and has no primary key implemented, even though examining the rest of the schema tells me one column is used AS a primary key, ie being replicated in another table, with identical name, etc. ie. This is not an 'end of line' table
This database also does not implement FKs.
My question is - is there ANY valid reason for a table (in Oracle for that matter) NOT to have a primary key?
I think PK is mandatory for almost all cases. Lots of reasons will exist but I'll treat some of them.
prevent to insert duplicate rows
rows will be referenced, so it must have a key for it
I saw very few cases make tables without PK (e.g. table for logs).
Not specific to Oracle but I recall reading about one such use-case where mysql was highly customized for a dam (electricity generation) project, I think. The input data from sensors were in the order 100-1000 per second or something. They were using timestamps for each record so didn't need a primary key (like with logs/logging mentioned in another answer here).
So good reasons would be:
Overhead, in the case of high frequency transactions
Necessity or Un-necessity in that case
"Uniqueness" maintained or inferred by application, not by db
In a normalized table, if every record needs to be unique and every field is referenced in other tables, then having a PK additionally adds an index overhead and if the PK would never actually be used in any SQL query (imho, I disagree with this but it's possible). But it should still have a unique index encompassing all the fields.
Bad reasons are infinite :-)
The most frequent bad reason which is actually responsible for the lack of a primary key is when DBs are designed by application/code-developers with little or no DB experience, who want to (or think they should) handle all data constraints in the application.
Any valid reason? I'd say "No"--I'm a database guy--but there are places that insist on using the database as a dumb data store. They usually implement all integrity "constraints" in application code.
Putting integrity constraints into application code isn't usually done to improve performance. In fact, if you built one database that enforces all the known constraints, and you built another with functionally identical constraints only in application code, the first one would almost certainly run rings around the second one.
Instead, application-level constraints usually hope to increase flexibility. (And, in the process, some of the known constraints are usually dropped, which appears to improve performance.) If it becomes inconvenient to enforce certain constraints in order to bulk load some scruffy data, an application programmer can just side-step the application-level constraints for a little while, then clean up the data when it's more convenient.
I'm not a db expert but I remember a conversation with a friend who worked in the Oracle apps dept. who told me that this was done to handle emergencies. If there was a problem in some report being generated which you could fix by putting in a row, db level constraints often stand in your way. They generally implemented things like unique primary keys in the application rather than the database. It was inefficient but enough and for them and much more manageable in case of a disaster recovery scenario.
You need a primary key to enforce uniqueness for a subset of its columns (useful if you need to refer to individual rows). It also speeds up certain queries because of the index associated to it.
If you do not need that index, or that uniqueness constraint, then you may not need a primary key (the index does not come free).
An example that comes to mind are logging tables, that just record some data (that is never updated or queried for individual records).
There is a small overhead when inserting to a table with an index and you need an index if you have a primary key. Downside of course is that finding a row is very costly.