Smart chunking from a huge table

Smart chunking from a huge table - algorithm

I have a huge table in a data warehouse (Vertica). I am accessing this table in chunks for optimization purposes. The way I am deciding my chunks is pretty straightforward. I have a primary key column say A and I take a MAX(A). I have a chunk size of 20000 and I have now created (A/20000)+1 chunks. I frame query for each chunk and retrieve the data .
There problem with this approach is as follows:
My number of chunks is dependent on MAX(A) and MAX(A) is growing very fast and thereby my number of chunks increases with it as well.
I have decided on number 20000 because that is what gives me optimal performance. But distribution of primary key within the chunks of 20000 is so scattered. For example the 0-20000 might contain only 3 elements and range 20000-40000 might contain 500 elements and no ranges come close to 20000.
I am trying to figure whether there are any good approximation algorithm for this problem which minimizes the number of chunks and fill in close to 20000 primary keys in one chunk.
Any pointers towards the solution is appreciated.

I'm not sure what optimization purposes means, but I think the best approach would be to create a timestamp column, or use an eligible timestamp column to partition on. You could then partition on a larger frame of reference so there isn't a wide range between the partitions.
If the table is partitioned, it will be able to benefit from partition pruning. This means that Vertica can eliminate the storage containers during query execution which do not match on the timestamp predicate.
Otherwise, you can look at the segmentation clause and use the max/min from the storage containers. This could be slightly more complicated.

Related

Hive partition scenario and how it impacts performance

I want to ask regarding the hive partitions numbers and how they will impact performance.
let me reflect this on a real example;
I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns.
for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).
Mainly the queries that will select from this table
50% of them shall use the exact order of partitions..
25% shall use only 1-3 partitions and not using the other 2.
25% only using 1st partition
So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).

I would go with 3 partition columns, since that will a) exactly match ~50% of your query profiles, and b) substantially reduce (prune) the number of scanned partitions for the other 50%. At the same time, you won't be pressured to increase your Hive MetaStore (HMS) heap memory and beef up HMS backend database to work efficiently with 250 x 364 = 91,000 partitions.
Since the time a 10K limit was introduced, significant efforts have been made to improve partition-related operations in HMS. See for example JIRA HIVE-13884, that provides the motivation to keep that number low, and describes the way high numbers are being addressed:
The PartitionPruner requests either all partitions or partitions based
on filter expression. In either scenarios, if the number of partitions
accessed is large there can be significant memory pressure at the HMS
server end.
... PartitionPruner [can] first fetch the partition names (instead of
partition specs) and throw an exception if number of partitions
exceeds the configured value. Otherwise, fetch the partition specs.
Note that partition specs (mentioned above) and statistics gathered per partition (always recommended to have for efficient querying), is what constitutes the bulk of data HMS should store and cache for good performance.

Partition index to reduce buffer busy waits?

From time to time our Oracle response times decrease significally for a minute or two, without having extra load.
we were able to identify an insert statement, which produces a lot of buffer busy waits.
From the ADDM report, we got the following hint:
Consider partitioning the INDEX "IDX1" with object
ID 4711 in a manner that will evenly distribute concurrent DML across
multiple partitions.
To be honest: I am not sure what that means. I don't know what a partitioned index is. I only can Image that it means to create a Partition with a local index.
Can you help me out here?
There is a very high frequency of reading and writing to that table. no updates or deletes are used.
Thanks,
E.

I am not sure what that means.
Oracle is telling you that there is a lot of concurrent ("at the same time") activity on a very small part of your index. This happens a lot.
Consider an index column TAB1_PK on table TAB1 whose values are inserted from a sequence TAB1_S. Suppose you have 5 database sessions all inserting into TAB1 at the same time.
Because TAB1_PK is indexed, and because the sequence is generating values in numeric order, what happens is that all those sessions have to read and update the same blocks of the index at the same time.
This can cause a lot of contention -- way more than you would expect, due to the way indexes work with multi-version read consistency. I mean, in some rare situations (depending on how the transaction logic is written and the number of concurrent sessions), it can really be crippling.
The (really) old way to avoid this problem was to use a reverse key index. That way, the sequential column values did not all go to the same index blocks.
However, that is a two-edged sword. On the one hand, you get less contention because you're inserting all over the index (good). On the other hand, your rows are going all over the index, meaning you cannot cache them all. You've just turned a big logical I/O problem into a physical I/O problem!
Nowadays, we have a better solution -- a GLOBAL HASH PARTITION on the index.
With a GHP, you can specify the number of hash buckets and use that to trade-off between how much contention you need to handle vs how compact you want the index updates (for better buffer caching). The more index hash partitions you use, the better your concurrency but the worse your index block buffer caching will be.
I find a number (of global hash partitions) around 16 is pretty good.

Oracle performance database size increases

I have a high level question. Say I have a SQL query that takes 30ms to complete, it runs against an indexed column on a table with 1million records. Now if the table size is increased to 5million records should I expect the query to take 5 times as long (as 5 times the indexes have to be searched), so 150ms. I apologise if this is too simplistic, I have a program that is running 10 SQL (indexed) against a table that is going to be increased by this factor, the queries currently take 300ms and I am concerned this would increase to 1.5s. Any help would be appreciated!

You can think of an index lookup as doing a search through a binary tree followed by a fetch of the page with the appropriate data. Typically, the index would fit in memory and the search through the index would be quite fast. Multiplying the data size of 10 would increase the depth of the tree by 3 or 4. With in-memory comparison operations this would not be noticeable for most queries. (There are other types of indexes besides binary b-trees, but this is a convenient model for thinking about performance.)
The data fetch then could incur the overhead of reading a page from disk. That should still be quite fast.
So, the easy answer to your question is: no. However, this assumes that the query is something like:
select t.*
from table t
where t.indexcol = CONSTANTVALUE
And, it assumes that the query only returns one row. Things that might affect the performance as the table size increases include:
The size of the returned data set increases with the size of the table. Returning more values necessarily takes longer. For some queries, the performance is more dependent on the mechanism for returning values than calculating/fetching the data.
The query contains join or group by.
The statistics of the table are up-to-date, so the optimizer doesn't accidentally choose a full table scan rather than an index lookup.
You are in a memory-constrained environment where the index doesn't fit in memory. Or, the entire table is in memory when smaller but incurs the overhead of a cache-miss as it gets larger.

Time for updating a sqlite3 index fluctuates too much

I have a large-ish sqlite3 (3.6.22) database (about 1 GB, 5 million rows) with a single table indexed on one column. The problem is that the time to do a typical INSERT transaction fluctuates widely. I insert about 10000 rows at a time (wrapped in a transaction of course). Often it takes about 1.5 seconds, but about every fifth transaction it suddenly takes several minutes for the very same transaction to complete. I've done a lot of experimentation, and I've discovered that the phenomena only occurs if there is an index, which makes me think it is updating the index which takes a lot of time.
I need more consistent performance. A bit higher average insertion times times would be ok, if I can only avoid that some transactions suddenly takes 200x as long as the previous one... What should I do?
Here's the schema. The strings in blocks.md5 are always exactly 32 bytes long and likely unique. The rolling.value column will contain very large 64-bit integers.
CREATE TABLE blocks (blob char(32) NOT NULL,
offset long NOT NULL,
md5 char(32) NOT NULL,
row_md5 char(32));
CREATE TABLE rolling (value INT NOT NULL);
CREATE INDEX index_md5 ON blocks (md5);
CREATE UNIQUE INDEX index_rolling ON rolling (value);

I don't know exactly how sqlite indexes are implemented, but I'd expect the behavior you describe if they were storing the index on disk or reordering the data.
Imagine a scenario where when they are allocating blocks for the index, they start some page with N slots for data. When the page fills up, they have to allocate another and split the data between them.
When you're inserting your data, the ordering of the MD5 will be as random as it gets, so every page will fill up independently. There isn't any reasonable way for the indexing strategy to know that.
Other databases will even recommend using different indexing strategies than normal for strings, especially in the case of something like random MD5s.
Trying to do this in an all memory database would tell you whether its algorithmic or disk access.
I've only really tried to avoid this in an offline system where I could sort data before inserting. After it was all inserted I would index it and that was as fast as I could find. If you're doing 10k at a time, that might be your use case, though I don't know.

Performance with sequentially increasing primary key

Looking for guidance on selecting a database provider for a specific key pattern.
The only key field will be a pre-allocated unique sequentially-increasing number.
During each day between
50 and 100 thousand items will be added,
processed (updated), and then retained for a week or so,
after which usually the lowest-numbered records will be deleted. The number of
records will not fluctuate by very much from day to day but may drop at weekends.
The numbers will probably wrap back to 1 after 100M or so.
I need to find a database implementation where the efficiency of the index lookup,
addition and deletion remains constant. Should I be worried that the performance might drop off as the key value range moves continuously upwards?

index lookup, addition and deletion remains constant
You could ensure it remains constant by rebuilding the indexes every insert (just constantly really slow - no performance drop off at all :)), or close to constant by running index maintenance every hour/day etc.
that the performance might drop off as the key value range moves continuously upwards?
As long as you've got an index, it should be logN performance - e.g. having 1,000,000 rows will be around half the speed of 1,000 rows (when searching for an indexed value). (1,000,000,000,000 will be half that speed again).
So no, you shouldn't need to worry about performance.
The numbers will probably wrap back to 1 after 100M or so.
Ok - if you want. Generally, no need really - just use a big int.
As always with performance: test what you want to do. Make a script that inserts 10,000,000 rows, and see what happens.
My point here being that if you're going to wrap ids at 100M records, the worst you can do is actually have them all allocated. This would represent the fragmented index condition as well (where you only have say 100K records, but they're distributed in a space of 10M) - but you will do index/database maintenance right?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio