CosmosDB throughput limit of single partition? - sharding

In the CosmosDB documentation, Microsoft hints at a throughput limit on a single partition, but does not specify the limit. We is the limit?. Here is the relevant documentation: https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
And the relevant quote:
Behind the scenes, Azure Cosmos DB provisions partitions needed to serve T requests/s. If T is higher than the maximum throughput per partition t, then Azure Cosmos DB provisions N = T/t partitions.

Doesn't explicitly answer your question, but the reason this value is not explicitly mentioned is because it will be changed (increased) as the Azure Cosmos DB team changes hardware, or rolls out hardware upgrades. The intent is to show that there is always a limit per partition (machine), and that partition keys will be distributed across these partitions.
You can discover the current value by saturating the writes for a single partition key at maximum throughput.

Related

Load 600+ million records in Synapse Dedicated Pool with Oracle as Source

I am trying to do a full load a very huge table (600+ million records) which resides in an Oracle On-Prem database. My destination is Azure Synapse Dedicated Pool.
I have already tried following:
Using ADF Copy activity with Source Partitioning, as source table is having 22 partitions
I increased the Copy Parallelism and DIU to a very high level
Still, I am able to fetch only 150 million records in 3 hrs whereas the ask is to complete the full load in around 2 hrs as the source would be freezed to users during that time frame so that Synapse can copy the data
How a full copy of data can be done from Oracle to Synapse in that time frame?
For a change, I tried loading data from Oracle to ADLS Gen 2, but its slow as well
There are a number of factors to consider here. Some ideas:
how fast can the table be read? What indexing / materialized views are in place? Is there any contention at the database level to rule out?
Recommendation: ensure database is set up for fast read on the table you are exporting
as you are on-premises, what is the local network card setup and throughput?
Recommendation: ensure local network setup is as fast as possible
as you are on-premises, you must be using a Self-hosted Integration Runtime (SHIR). What is the spec of this machine? eg 8GB RAM, SSD for spooling etc as per the minimum specification. Where is this located? eg 'near' the datasource (in the same on-premises network) or in the cloud. It is possible to scale out SHIRs by having up to four nodes but you should ensure via the metrics available to you that this is a bottleneck before scaling out.
Recommendation: consider locating the SHIR 'close' to the datasource (ie in the same network)
is the SHIR software version up-to-date? This gets updated occasionally so it's good practice to keep it updated.
Recommendation: keep the SHIR software up-to-date
do you have Express Route or going across the internet? ER would probably be faster
Recommendation: consider Express Route. Alternately consider Data Box for a large one-off export.
you should almost certainly land directly to ADLS Gen 2 or blob storage. Going straight into the database could result in contention there and you are dealing with Synapse concepts such as transaction logging, DWU, resource class and queuing contention among others. View the metrics for the storage in the Azure portal to determine it is under stress. If it is under stress (which I think unlikely), consider multiple storage accounts
Recommendation: load data to ADLS2. Although this might seem like an extra step, it provides a recovery point and avoids contention issues by attempting to do the extract and load all at the same time. I would only load directly to the database if you can prove it goes faster and you definitely don't need the recovery point
what format are you landing in the lake? Converting to parquet is quite compute intensive for example. Landing to the lake does leave an audit trail and give you a position to recover from if things go wrong
Recommendation: use parquet for a compressed format. You may need to optimise the file size.
ultimately the best thing to do would be one big bulk load (say taking the weekend) and then do incremental upserts using a CDC mechanism. This would allow you to meet your 2 hour window.
Recommendation: consider a one-off big bulk load and CDC / incremental loads to stay within the timeline
In summary, it's probably your network but you have a lot of investigation to do first, and then a number of options I've listed above to work through.
wBob provided a good summary of things you good look at to increase your transfer speed. In addition to that, you could try to bulk export your data into chunks of data files, and in-parallel transfer the files to azure datalake or azure blob storage, this way you can maximize your network throughput.
Once the data is on the datalake, you can scale up your Synapse instance and take advantage of fast loads using the COPY command.
I faced the same problem in our organization, and the fastest way to get the data out of SQL Server was using bcp into a fast storage layer.

Does Oracle Nosql Cloud Service have provision to set max read units consumption per second?

Does Oracle Nosql Cloud Service have provision to set max read units consumption per second. For e.g. In 40K read units, I want to reserve 20K for 1st operation and rest 20K for 2nd operation. In order to make sure 20K is always reserved for 1st operation, I want to set max read units consumption per second for 2nd operation. Is this something possible to do?
The provision values are for the entire table, so if a table has 40K read units /second, it’s up to the application to apportion them per operation.The SDKs have rate limiting support that can help with this. For example, see https://oracle.github.io/nosql-java-sdk/oracle/nosql/driver/NoSQLHandleConfig.html#setDefaultRateLimitingPercentage(double).
Sets a default percentage of table limits to use. This may be useful
for cases where a client should only use a portion of full table
limits. This only applies if rate limiting is enabled using
setRateLimitingEnabled(boolean).
You could use this method in your case.

Application Architecture for scalable hyperledger v1.4 with IOT data

I am working on the Hyperledger Application that can store sensor data from IoT.
Using HLF v1.4 with Raft. Each IoT device will provide JSON data at fixed intervals which gets stored in Hyperledger. I have worked with HLF v1.3 which doesn't scale very well.
With v1.4, I am planning to start with 2 organization setup with 5 peers for each organization.
But the limiting factor seems to be, as the number of blocks increase by adding new transactions and querying the network takes a longer time.
What are the steps that can be taken to scale the HLF with v1.4 or onwards.
What type of Server specs should be used for good performance, like RAM, CPUs when selecting a server e.g EC2
You can change your block size. If you increase the size of your block, then the number of blocks will get reduced. For better query and Invoke functionality you can limit the data storing into Blockchain. Yes, Computation speed also matters in Blockchain, if you have good speed, tps may vary. Try with instance types like t3 medium or more than that like t3 large.

Sequence cache and performance

I could see the DBA team advises to set the sequence cache to a higher value at the time of performance optimization. To increase the value from 20 to 1000 or 5000.The oracle docs, says the the cache value,
Specify how many values of the sequence the database preallocates and keeps in memory for faster access.
Somewhere in the AWR report I can see,
select SEQ_MY_SEQU_EMP_ID.nextval from dual
Can any performance improvement be seen if I increase the cache value of SEQ_MY_SEQU_EMP_ID.
My question is:
Is the sequence cache perform any significant role in performance? If so how to know what is the sufficient cache value required for a sequence.
We can get the sequence values from oracle cache before them used out. When all of them were used, oracle will allocate a new batch of values and update oracle data dictionary.
If you have 100000 records need to insert and set the cache size is 20, oracle will update data dictionary 5000 times, but only 20 times if you set 5000 as cache size.
More information maybe help you: http://support.esri.com/en/knowledgebase/techarticles/detail/20498
If you omit both CACHE and NOCACHE, then the database caches 20 sequence numbers by default. Oracle recommends using the CACHE setting to enhance performance if you are using sequences in an Oracle Real Application Clusters environment.
Using the CACHE and NOORDER options together results in the best performance for a sequence. CACHE option is used without the ORDER option, each instance caches a separate range of numbers and sequence numbers may be assigned out of order by the different instances. So more the value of CACHE less writes into dictionary but more sequence numbers might be lost. But there is no point in worrying about losing the numbers, since rollback, shutdown will definitely "lose" a number.
CACHE option causes each instance to cache its own range of numbers, thus reducing I/O to the Oracle Data Dictionary, and the NOORDER option eliminates message traffic over the interconnect to coordinate the sequential allocation of numbers across all instances of the database. NOCACHE will be SLOW...
Read this
By default in ORACLE cache in sequence contains 20 values. We can redefine it by given cache clause in sequence definition. Giving cache caluse in sequence benefitted into that when we want generate big integers then it takes lesser time than normal, otherwise there are no such drastic performance increment by declaring cache clause in sequence definition.
Have done some research and found some relevant information in this regard:
We need to check the database for sequences which are high-usage but defined with the default cache size of 20 - the performance
benefits of altering the cache size of such a sequence can be
noticeable.
Increasing the cache size of a sequence does not waste space, the
cache is still defined by just two numbers, the last used and the
high water mark; it is just that the high water mark is jumped by a
much larger value every time it is reached.
A cached sequence will return values exactly the same as a non-cached
one. However, a sequence cache is kept in the shared pool just as
other cached information is. This means it can age out of the shared
pool in the same way as a procedure if it is not accessed frequently
enough. Everything is the cache is also lost when the instance is
shut down.
Besides spending more time updating oracle data dictionary having small sequence caches can have other negative effects if you work with a Clustered Oracle installation.
In Oracle 10g RAC Grid, Services and Clustering 1st Edition by Murali Vallath it is stated that if you happen to have
an Oracle Cluster (RAC)
a non-partitioned index on a column populated with an increasing sequence value
concurrent multi instance inserts
you can incur in high contention on the rightmost index block and experience a lot of Cluster Waits (up to 90% of total insert time).
If you increase the size of the relevant sequence cache you can reduce the impact of Cluster Waits on your index.

How to organize fair cache in Cassandra Cluster for large number of customers using it?

Is there any recommendation of how to serve large number of customers with Cassandra?
Imagine that all of them read rows (specific per customer) in Cassandra. If we have one big Column Family (CF) we will correspondingly have single memtable per this CF. So, that customers who read data more frequently will displace cache entries of less frequent-in-read customers. And quality of service (i.e. read speed) will differ for different users. This is not fair (all customers must experience the same performance).
Is it normal to allocate separate CF's per customer (e.g. 5000 CF's or more)? As I understand this will cause creation of 5000 memtables what will lead to fair caching because each customer will be served in separate cache (memtable). Am I correct?
And on the other hand, will creation of large number of CF's decrease performance rather than having single big CF?
Memtables are not caches, they are to ensure writes in Cassandra are sequential on disk. They are read from when doing queries, but are flushed when they are too big rather than using an eviction policy that is appropriate for a cache.
Having separate column families for each customer will be very inefficient - 1000s of CFs is too many. Better would be to make sure your customer CF remains in cache. If you have enough memory then it will (assuming you don't have other CFs on your Cassandra cluster). Or you could use the row cache and set the size to be big enough to hold all the data in your customer CF.

Resources