Frequent, highly fragmented index on heap table - uniqueidentifier

I have a table (a heap table) with a nonclustered index that frequently becomes highly fragmented. The data in column ID comes from (is imported from) data from a csv file and the ID is thereafter used in other table relations for reporting purpuses. The table is updated (data is inserted) from a csv several times a day. I frequently run INDEX REORGANIZATION to reduce the fragmentation.
Do you have any other ideas to help keep fragmentation from occurring so frequently?
The following is a sample script of the table:
CREATE TABLE [dbo].[MyTable](
[ID] uniqueidentifier NOT NULL,``
[EventID] uniqueidentifier NOT NULL,
[AssemblyID] uniqueidentifier NOT NULL,
[TimeStamp] [smalldatetime] NOT NULL,
[IsTrue] [bit] NOT NULL,
[IsExempt] [bit] NOT NULL CONSTRAINT [DF_IsExempt] DEFAULT ((0)),
CONSTRAINT [UQ_MyTable_ID] UNIQUE NONCLUSTERED ([ID] ))
GO

Do you have any other ideas to help keep fragmentation from occurring so frequently?
Another idea would be to lower the fill factor of that particular table.
While the fill factor does not affect the heap itself, the index is affected.
Also see: Intro to fill factor.
SQL Server only uses fillfactor when you’re creating, rebuilding, or reorganizing an index. It does not use fillfactor if it’s allocating a fresh new page at the end of the index.
Let’s look at the example of a clustered index where the key is an increasing INT identity value again. We’re just inserting rows and it’s adding new pages at the end of the index. The index was created with an 70% fillfactor (which maybe wasn’t a good idea). As inserts add new pages, those pages are filled as much as possible– likely over 70%. (It depends on the row size and how many can fit on the page.)
Having a lower fill factor will let SQL Server insert more rows randomly across your table and somewhat lower the amount of page splits. You'll have to test what works best for your data insertion patterns.

Related

How to choose columns for Partitioning and bucketing in hive Table?

What will be the ideal columns for partitioning and bucketing for the below schema? Is it necessary to implement both or one is good to go?
user_id INTEGER UNSIGNED,
product_id VARCHAR(20),
gender ENUM('M','F') default NULL,
age VARCHAR(6),
occupation TINYINT UNSIGNED default NULL,
city_category ENUM('A','B','C','D','E') default NULL,
stay_in_current_city_years VARCHAR(6),
martial_status TINYINT UNSIGNED default 0,
product_category_1 TINYINT UNSIGNED default 0,
product_category_2 TINYINT UNSIGNED default 0,
product_category_3 TINYINT UNSIGNED default 0,
purchase_amount INTEGER UNSIGNED default 0
The main goal is to do some analysis based on the above attributes using Hive.
In hive, you create a table based on the usage pattern and so you should choose both partitioning the bucketing based on what your Analysis Queries would look like.
However, the following things are advisable
Partitioning
Partitioning helps you speed up the queries with predicates (i.e. Where conditions). So in your case, if city_category is the field you are going to use most of the time in your where condition you should choose that field for partition.
It might degrade the performance of other queries.
Need to make sure that cardinality is not too high, otherwise, your query performance would be degraded.
To understand the above points you need to understand how partitioning works. When you create a partition (or subpartition), Hive creates a subfolder with that name and stores the data (files) into those folders.
So if you partition based on city_category your file would look like this.
/data/table_name/city_category=A
/data/table_name/city_category=B
...
/data/table_name/city_category=E
This helps hive to find a particular record if you provide city_category in Where condition as it has to just scan one folder.
However, if you try to find a record based on user_id or product_id then hive need to scan all the folders.
And let's say if you end up partitioning based on purchase_amount, then you will have a lot many folders. NameNode has to maintain the location of each folder and files and so it will create a lot of load on your NameNode and obviously decrease the performance of your query.
Bucketing
It helps you in speeding up your join query if another table you are joining has similar bucketing.
However, it's a good idea to make sure data is distributed evenly in bucketing.
What bucketing does it, it applies a hashing on a given field and based on that it stores the given record in bucketing.
So let's say if you bucket based on city_category and tell to create 50 buckets.
CLUSTERED BY (city_category) INTO 50 BUCKETS
as we have only 5 categories, other 45 buckets would be empty, this is something you don't want as it will degrade the performance of your query.

SQLITE: Best practices about using AUTOINCREMENT

According to the official manual:
"The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and disk I/O overhead and should be avoided if not strictly needed. It is usually not needed."
So it is better not to use it? Do you have any benchmark of using the implicit rowid against using AUTOINCREMENT?
As recommended in the documentation, it is better to not use AUTOINCREMENT unless you need to ensure that the alias of the rowid (aka the id) is greater then any that have been added. However, (in normal use) it's a moot point as such, as even without AUTOINCREMENT, until you have reached 9223372036854775807 rows then a higher rowid/id will result.
If you do reach the id/rowid of 9223372036854775807, then that's it if you have AUTOINCREMENT coded, as an SQLITE_FULL exception will happen. Whilst without AUTOINCREMENT attempts will be made to get an unused id/rowid.
AUTOINCREMENT adds a row (table if required) to sqlite_sequence that records the highest allocated id. The difference between with and without AUTOINCREMENT is that the sqlite_sequecence table is referenced, whilst without AUTOINCREMENT the isn't. So if a row is deleted that has the highest id AUTOINCREMENT gets the highest ever allocated id from the sqlite_sequence table (and user the greater of that or max(rowid)), without doesn't so it uses the highest in the table where the row is being inserted (equivalent to max(rowid)).
With limited testing an overhead of 8-12% was found to be the overhead as per What are the overheads of using AUTOINCREMENT for SQLite on Android?
.
I have tried sqlite3 autoincrement with python3 and sqlalchemy 1.4.
Before enable autoincrement on Integer Primary Key ID column, single insert use about less than 0.1 seconds. After enable this feature, single insert use more than 1.5 seconds.
The performance gap is big.

Delete rows vs Delete Columns performance

I'm creating the datamodel for a timeseries application on Cassandra 2.1.3. We will be preserving X amount of data for each user of the system and I'm wondering what is the best approach to design for this requirement.
Option1:
Use a 'bucket' in the partition key, so data for X period goes into the same row. Something like this:
((id, bucket), timestamp) -> data
I can delete a single row at once at the expense of maintaining this bucket concept. It also limits the range I can query on timestamp, probably resulting in several queries.
Option2:
Store all the data in the same row. N deletes are per column.
(id, timestamp) -> data
Range queries are easy again. But what about performance after many column deletes?
Given that we plan to use TTL to let the data expire, which of the two models would deliver the best performance? Is the tombstone overhead of Option1 << Option2 or will there be a tombstone per column on both models anyway?
I'm trying to avoid to bury myself in the tombstone graveyard.
I think it will all depend on how much data you plan on having for the given partition key you end up choosing, what your TTL is and what queries you are making.
I typically lean towards option #1, especially if your TTL is the same for all writes. In addition if you are using LeveledCompactionStrategy or DataTieredCompactionStrategy, Cassandra will do a great job keeping data from the same partition in the same SSTable, which will greatly improve read performance.
If you use Option #2, data for the same partition could likely be spread across multiple levels (if using LCS) or just in general multiple sstables, which may cause you to read from a lot of SSTables, depending on the nature of your queries. There is also the issue of hotspotting, where you could overload particular cassandra nodes if you have a really wide partition.
The other benefit of #1 (which you allude to), is that you can easily delete the entire partition, which creates a single tombstone marker which is much cheaper. Also, if you are using the same TTL, data within that partition will expire pretty much at the same time.
I do agree that it is a bit of a pain to have to make multiple queries to read across multiple partitions as it pushes some complexity into the application end. You may also need to maintain a separate table to keep track of the buckets for the given id if they can not be determined implicitly.
As far as performance goes, do you see it as likely that you will need to read cross-partitions when your application makes queries? For example, if you have a query for 'the most recent 1000 records' and a partition typically is wider than that, you may only need to make 1 query for Option #1. However, if you want to have a query like 'give me all records', Option #2 may be better as otherwise you'll need to a make queries for each bucket.
After creating the tables you described above:
CREATE TABLE option1 (
... id bigint,
... bucket bigint,
... timestamp timestamp,
... data text,
... PRIMARY KEY ((id, bucket), timestamp)
... ) WITH default_time_to_live=10;
CREATE TABLE option2 (
... id bigint,
... timestamp timestamp,
... data text,
... PRIMARY KEY (id, timestamp)
... ) WITH default_time_to_live=10;
I inserted a test row:
INSERT INTO option1 (id,bucket,timestamp,data) VALUES (1,2015,'2015-03-16 11:24:00-0500','test1');
INSERT INTO option2 (id,timestamp,data) VALUES (1,'2015-03-16 11:24:00-0500','test2');
...waited 10 seconds, queried with tracing on, and I saw identical tombstone counts for each table. So I either way that shouldn't be too much of a concern for you.
The real issue, is that if you think you'll ever hit the limit of 2 billion columns per partition, then Option #1 is the safe one. If you have a lot of data Option #1 might perform better (because you'll be eliminating the need to look at partitions that don't match your bucket), but really either one should be fine in that respect.
tl;dr;
As the issues of performance and tombstones are going to be similar no matter which option you choose, I'm thinking that Option #2 is the better one, just due to ease of querying.

For insert-performance consideration, should a clustered index on a timestamp be ascending or descending?

I just realized I have a clustered index on a Timestamp in descending order. I'm thinking about switching it to ascending, so that as new, ever-increasing timestamps are inserted, they are added to the end of the table. As it stands now, I suspect it has to add rows to the beginning of the table, and I wonder how SQL Server handles that.
Can it efficiently allocate new pages at the beginning of the table, and efficiently insert new rows into those pages, or would it be better filling up pages in the order of the timestamps and allocating new pages at the end with an ascending clustered index.
It's actually the same whether you add at start or end.
Page fills up, page splits, new page is allocated...
The new page may or not be contiguous whether it's at the start or the end.. which is why you run ALTER INDEX etc regularly.
The ASC/DEC order of the clustered index will matter more for SELECT/ORDER BY in practice... although I've noticed this less in SQL Server 2005 and above.

Primary Key Effect on Performance in SQLite

I have an sqlite database used to store information about backup jobs. Each run, it increases approximately 25mb as a result of adding around 32,000 entries to a particular table.
This table is a "map table" used to link certain info to records in another table... and it has a primary key (autoincrement int) that I don't use.
sqlite will reserve 1, 2, 4, or 8 bytes for INT column depending on its value. This table only has 3 additional columns, also of INT type.
I've added indexes to the database on the columns that I use as filters (WHERE) in my queries.
In the presence of indexes, etc. and in the situation described, do primary keys have any useful benefit in terms of performance?
Note: Performance is very, very important to this project - but not if 10ms saved on a 32,000 entry job means an additional 10MB of data!
A primary key index is used to look up a row for a given primary key. It is also used to ensure that the primary key values are unique.
If you search your data using other columns, the primary key index will not be used, and as such will yield no performance benefit. Its mere existence should not have a negative performance impact either, though.
An unnecessary index wastes disk space, and makes INSERT and UPDATE statements execute slower. It should have no negative impact on query performance.
If you really don't use this id what don't you drop this column + primary key? The only reason to keep a non-used primary key id column alive is to make it possible to create a master-detail relation with another table.
Another possibility is to keep the column but to drop the primary key. That will mean that the application has to take care of providing a unique id with every insert statement. Before and after each batch operation you have to check whether this column is still unique. This doesn't work in for instance MySQL and Oracle because of multi concurrency issues but it does work in sqlite.

Resources