Is there record number limit in LevelDB? - leveldb

Is there maximum key number limit in LevelDB or key number limit for productivity (like in Kyoto Cabinet: the number of records determines number of buckets that must be calculated before database is created; if number of records override that limit db loses productivity, but keep working)?

No limit that I would know about. However if database gets very large then most of the data is stored in the last level (about 89%) and merging with it may become expensive (a lot of files in the last level will overlap with to be merged data in the previous level).
Another thing that say at 40Gb you'll have 20480 2Mb files in one folder and it could be that your file system's performance will degrade with so many files.
To know for sure you need to experiment.

Related

Using ChronicleMap as a key-value database

I would like to use a ChronicleMap as a memory-mapped key-value database (String to byte[]). It should be able to hold up to the order of 100 million entries. Reads/gets will happen much more frequently than writes/puts, with an expected write rate of less than 10 entries/sec. While the keys would be similar in length, the length of the value could vary strongly: it could be anything from a few bytes up to tens of Mbs. Yet, the majority of values will have a length between 500 to 1000 bytes.
Having read a bit about ChronicleMap, I am amazed about its features and am wondering why I can't find articles describing it being used as a general key-value database. To me there seem to be a lot of advantages of using ChronicleMap for such a purpose. What am I missing here?
What are the drawbacks of using ChronicleMap for the given boundary conditions?
I voted for closing this question because any "drawbacks" would be relative.
As a data structure, Chronicle Map is not sorted, so it doesn't fit when you need to iterate the key-value pairs in the sorted order by key.
Limitation of the current implementation is that you need to specify the number of elements that are going to be stored in the map in advance, and if the actual number isn't close to the specified number, you are going to overuse memory and disk (not very severely though, on Linux systems), but if the actual number of entries exceeds the specified number by approximately 20% or more, operation performance starts to degrade, and the performance hit grows linearly with the number of entries growing further. See https://github.com/OpenHFT/Chronicle-Map/issues/105

Collecting large statistical sets with pg_stat_statements?

According to Postgres pg_stat_statements documentation:
The module requires additional shared memory proportional to
pg_stat_statements.max. Note that this memory is consumed whenever the
module is loaded, even if pg_stat_statements.track is set to none.
and also:
The representative query texts are kept in an external disk file, and
do not consume shared memory. Therefore, even very lengthy query texts
can be stored successfully. However, if many long query texts are
accumulated, the external file might grow unmanageably large.
From these it is unclear what the actual memory cost of a high pg_stat_statements.max would be - say at 100k or 500k (default is 5k). Is it safe to set the levels that high, would could be the negative repercussions of such high levels? Would aggregating statistics into an external database via logstash/fluentd be a preferred approach above certain sizes?
1.
from what I have read, it hashes the query and keeps it in DB, saving the text to FS. So next concern is more expected then overloaded shared memory:
if many long query texts are accumulated, the external file might grow
unmanageably large
the hash of text is so much smaller then text, that I think you should not worry about extension memory consumption comparing long queries. Especially knowing that extension uses Query Analyser (which will work for EVERY query ANYWAY):
the queryid hash value is computed on the post-parse-analysis
representation of the queries
Setting pg_stat_statements.max 10 times bigger should take 10 times more shared memory I believe. The grows should be linear. It does not say so in documentation, but logically should be so.
There is no answer if it is safe or not to set setting to distinct value, because there is no data on other configuration values and HW you have. But as growth should be linear, consider this answer: "if you set it to 5K, and query runtime has grown almost nothing, then setting it to 50K will prolong it almost nothing times ten". BTW, my question - who is gong to dig 50000 slow statements? :)
2.
This extension already makes a pre-aggregation for "dis-valued" statement. You can select it straight on DB, so moving data to other db and selecting it there will only give you the benefit of unloading the original DB and loading another. In other words you save 50MB for a query on original, but spend same on another. Does it make sense? For me - yes. This is what I do myself. But I also save execution plans for statement (which is not a part of pg_stat_statements extension). I believe it depends on what you have and what you have. Definitely there is no need for that just because of a number of queries. Again unless you have so big file that extension can
As a recovery method if that happens, pg_stat_statements may choose to
discard the query texts, whereupon all existing entries in the
pg_stat_statements view will show null query fields

Sequence cache and performance

I could see the DBA team advises to set the sequence cache to a higher value at the time of performance optimization. To increase the value from 20 to 1000 or 5000.The oracle docs, says the the cache value,
Specify how many values of the sequence the database preallocates and keeps in memory for faster access.
Somewhere in the AWR report I can see,
select SEQ_MY_SEQU_EMP_ID.nextval from dual
Can any performance improvement be seen if I increase the cache value of SEQ_MY_SEQU_EMP_ID.
My question is:
Is the sequence cache perform any significant role in performance? If so how to know what is the sufficient cache value required for a sequence.
We can get the sequence values from oracle cache before them used out. When all of them were used, oracle will allocate a new batch of values and update oracle data dictionary.
If you have 100000 records need to insert and set the cache size is 20, oracle will update data dictionary 5000 times, but only 20 times if you set 5000 as cache size.
More information maybe help you: http://support.esri.com/en/knowledgebase/techarticles/detail/20498
If you omit both CACHE and NOCACHE, then the database caches 20 sequence numbers by default. Oracle recommends using the CACHE setting to enhance performance if you are using sequences in an Oracle Real Application Clusters environment.
Using the CACHE and NOORDER options together results in the best performance for a sequence. CACHE option is used without the ORDER option, each instance caches a separate range of numbers and sequence numbers may be assigned out of order by the different instances. So more the value of CACHE less writes into dictionary but more sequence numbers might be lost. But there is no point in worrying about losing the numbers, since rollback, shutdown will definitely "lose" a number.
CACHE option causes each instance to cache its own range of numbers, thus reducing I/O to the Oracle Data Dictionary, and the NOORDER option eliminates message traffic over the interconnect to coordinate the sequential allocation of numbers across all instances of the database. NOCACHE will be SLOW...
Read this
By default in ORACLE cache in sequence contains 20 values. We can redefine it by given cache clause in sequence definition. Giving cache caluse in sequence benefitted into that when we want generate big integers then it takes lesser time than normal, otherwise there are no such drastic performance increment by declaring cache clause in sequence definition.
Have done some research and found some relevant information in this regard:
We need to check the database for sequences which are high-usage but defined with the default cache size of 20 - the performance
benefits of altering the cache size of such a sequence can be
noticeable.
Increasing the cache size of a sequence does not waste space, the
cache is still defined by just two numbers, the last used and the
high water mark; it is just that the high water mark is jumped by a
much larger value every time it is reached.
A cached sequence will return values exactly the same as a non-cached
one. However, a sequence cache is kept in the shared pool just as
other cached information is. This means it can age out of the shared
pool in the same way as a procedure if it is not accessed frequently
enough. Everything is the cache is also lost when the instance is
shut down.
Besides spending more time updating oracle data dictionary having small sequence caches can have other negative effects if you work with a Clustered Oracle installation.
In Oracle 10g RAC Grid, Services and Clustering 1st Edition by Murali Vallath it is stated that if you happen to have
an Oracle Cluster (RAC)
a non-partitioned index on a column populated with an increasing sequence value
concurrent multi instance inserts
you can incur in high contention on the rightmost index block and experience a lot of Cluster Waits (up to 90% of total insert time).
If you increase the size of the relevant sequence cache you can reduce the impact of Cluster Waits on your index.

Optimizing massive insert performance...?

Given: SQL Server 2008 R2. Quit some speedin data discs. Log discs lagging.
Required: LOTS LOTS LOTS of inserts. Like 10.000 to 30.000 rows into a simple table with two indices per second. Inserts have an intrinsic order and will not repeat, as such order of inserts must not be maintained in short term (i.e. multiple parallel inserts are ok).
So far: accumulating data into a queue. Regularly (async threadpool) emptying up to 1024 entries into a work item that gets queued. Threadpool (custom class) has 32 possible threads. Opens 32 connections.
Problem: performance is off by a factor of 300.... only about 100 to 150 rows are inserted per second. Log wait time is up to 40% - 45% of processing time (ms per second) in sql server. Server cpu load is low (4% to 5% or so).
Not usable: bulk insert. The data must be written as real time as possible to the disc. THis is pretty much an archivl process of data running through the system, but there are queries which need access to the data regularly. I could try dumping them to disc and using bulk upload 1-2 times per second.... will give this a try.
Anyone a smart idea? My next step is moving the log to a fast disc set (128gb modern ssd) and to see what happens then. The significant performance boost probably will do things quite different. But even then.... the question is whether / what is feasible.
So, please fire on the smart ideas.
Ok, anywering myself. Going to give SqlBulkCopy a try, batching up to 65536 entries and flushing them out every second in an async fashion. Will report on the gains.
I'm going through the exact same issue here, so I'll go through the steps i'm taking to improve my performance.
Separate the log and the dbf file onto different spindle sets
Use basic recovery
you didn't mention any indexing requirements other than the fact that the order of inserts isn't important - in this case clustered indexes on anything other than an identity column shouldn't be used.
start your scaling of concurrency again from 1 and stop when your performance flattens out; anything over this will likely hurt performance.
rather than dropping to disk to bcp, and as you are using SQL Server 2008, consider inserting multiple rows at a time; this statement inserts three rows in a single sql call
INSERT INTO table VALUES ( 1,2,3 ), ( 4,5,6 ), ( 7,8,9 )
I was topping out at ~500 distinct inserts per second from a single thread. After ruling out the network and CPU (0 on both client and server), I assumed that disk io on the server was to blame, however inserting in batches of three got me 1500 inserts per second which rules out disk io.
It's clear that the MS client library has an upper limit baked into it (and a dive into reflector shows some hairy async completion code).
Batching in this way, waiting for x events to be received before calling insert, has me now inserting at ~2700 inserts per second from a single thread which appears to be the upper limit for my configuration.
Note: if you don't have a constant stream of events arriving at all times, you might consider adding a timer that flushes your inserts after a certain period (so that you see the last event of the day!)
Some suggestions for increasing insert performance:
Increase ADO.NET BatchSize
Choose the target table's clustered index wisely, so that inserts won't lead to clustered index node splits (e.g. autoinc column)
Insert into a temporary heap table first, then issue one big "insert-by-select" statement to push all that staging table data into the actual target table
Apply SqlBulkCopy
Choose "Bulk Logged" recovery model instad of "Full" recovery model
Place a table lock before inserting (if your business scenario allows for it)
Taken from Tips For Lightning-Fast Insert Performance On SqlServer

Performance with sequentially increasing primary key

Looking for guidance on selecting a database provider for a specific key pattern.
The only key field will be a pre-allocated unique sequentially-increasing number.
During each day between
50 and 100 thousand items will be added,
processed (updated), and then retained for a week or so,
after which usually the lowest-numbered records will be deleted. The number of
records will not fluctuate by very much from day to day but may drop at weekends.
The numbers will probably wrap back to 1 after 100M or so.
I need to find a database implementation where the efficiency of the index lookup,
addition and deletion remains constant. Should I be worried that the performance might drop off as the key value range moves continuously upwards?
index lookup, addition and deletion remains constant
You could ensure it remains constant by rebuilding the indexes every insert (just constantly really slow - no performance drop off at all :)), or close to constant by running index maintenance every hour/day etc.
that the performance might drop off as the key value range moves continuously upwards?
As long as you've got an index, it should be logN performance - e.g. having 1,000,000 rows will be around half the speed of 1,000 rows (when searching for an indexed value). (1,000,000,000,000 will be half that speed again).
So no, you shouldn't need to worry about performance.
The numbers will probably wrap back to 1 after 100M or so.
Ok - if you want. Generally, no need really - just use a big int.
As always with performance: test what you want to do. Make a script that inserts 10,000,000 rows, and see what happens.
My point here being that if you're going to wrap ids at 100M records, the worst you can do is actually have them all allocated. This would represent the fragmented index condition as well (where you only have say 100K records, but they're distributed in a space of 10M) - but you will do index/database maintenance right?

Resources