System consists of tens of peer servers (none of them is leader/master).
To create an entity, service should acquire the next sequential number basing on some group key: there are different sequences for each group key.
Let's say to create instance of entity A, service has to get sequence with group key A, while to create entity B, service has to get sequence number with group key B.
Getting the same number twice is prohibited. Missing numbers are allowed.
Currently I have implemented solution with a RDBMS, having a record for each of the group keys and updating its current sequence value in a transaction:
UPDATE SEQUENCES SET SEQ_ID=SEQ_ID + 1 WHERE KEY = ?
However, this approach only allows to get 200-300 queries per seconds because of locking and synchronisation.
Another approach I consider is to have local buffer of sequences at the each node. Once buffer is empty, service queries DB to get the next batch of ids and store them locally: UPDATE SEQUENCES SET SEQ_ID=SEQ_ID + 1000 WHERE KEY = ? if the batch size is 1000. This may help to lower contention. However, if node goes down it loses all these acquired sequence numbers, which, if happened frequently, can lead to overflowing the maximum value of sequence (e.g. max int).
I don't know in advance, how many sequence numbers will be needed.
I don't want to introduce additional dependencies between servers and have one of them to generate sequence numbers and serving to the others.
What are the general ways to solve similar problems?
Which other RDBMS-based approaches can be considered?
Which other NOT RDBMS-based approaches can be considered?
What other problems can happen with local buffer solution?
Related
I just read that the maximum parallelism (defined by setMaxParallelism) of a Flink job cannot be changed without losing state. This surprised me a bit, and it is not that hard to imagine a scenario where one starts running a job, only to find out the load is eventually 10x larger than expected (or perhaps the efficiency of the code is below expectations) resulting in a desire to increase parallelism.
I could not find many reasons for this, other than some references to key groups. The most tangible statement I found here:
The max parallelism mustn't change when scaling the job, because it would destroy the mapping of keys to key groups.
However, this still leaves me with the questions:
Why is it hard/impossible to let a job change its max paralellism?
Based on the above, the following conceptual solution came to mind:
In the state, keep track of the last used max parallelism
When starting a job, indicate the desired max parallelism
Given that both settings are known, it should be possible to infer how the mappings would need to change to remain valid initially.
If needed a new state could be defined based on the old state with the new maxparallelism, to 'fit' the new job.
I am not saying this conceptual solution is ideal, or that it would be trivial to implement. I just wonder if there is more to the very rigid nature of the maximum parallelism. And trying to understand whether it is just a matter of 'this flexibility is not implemented yet' or 'this goes so much against the nature of Flink that one should not want it'.
Every key is assigned to exactly one key group by computing a hash of the key modulo the number of key groups. So changing the number of key groups affects the assignment of keys to key groups. Each task manager is responsible for one or more key groups, so the number of key groups is the same as the maximum parallelism.
The reason this number is painful to change is that it is effectively baked into the state snapshots (checkpoints and savepoints). These snapshots are indexed by key group, so that on system start-up, each task manager can efficiently load just the state they require.
There is are in-memory data structures that scale up significantly as the number of key groups rises, which is why the max parallelism doesn't default to some rather large value (the default is 128).
The State Processor API can be used to rewrite state snapshots, should you need to change the number of key groups, or migrate between state backends.
I am using Google Datastore and will need to query it to retrieve some entities. These entities will need to be sorted by newest to oldest. My first thought was to have a date_created property which contains a timestamp. I would then index this field and sort on this field. The problem with this approach is it will cause hotspots in the database (https://cloud.google.com/datastore/docs/best-practices).
Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates.
Obviously sorting data on dates is properly the most common sorting performed on a database. If I can't index timestamps, is there another way I can accomplish being able to sort my queires from newest to oldest without hotspots?
As you note, indexing monotonically changed values doesn't scale and can lead to hotspots. Whether you are potentially impacted by this depends on your particular usage.
As a general rule, the hotspotting point of this pattern is 500 writes per second. If you know you're definitely going to stay under that you probably don't need to worry.
If you do need higher than 500 writes per second, but have a upper limit in mind, you could attempt a sharded approach. Basically, if you upper on writes per second is x, then n = ceiling(x/500), where n is the number of shards. When you write your timestamp, prepend random(1, n) at the start. This creates n random key ranges that each can perform up to 500 writes per second. When you query your data, you'll need to issue n queries and do some client side merging of the result streams.
We are using a timestamp to ensure that entries in a log table are recorded sequentially, but we have found a potential flaw. Say, for example, we have two nodes in our RAC and the node timestamps are 1000ms off. Our app server inserts two log entries within 30ms seconds of each other. The first insert is serviced by Node1 and the second by Node2. With 1000ms difference between the two nodes, the timestamp could potentially show the log entries occurring in the wrong order! (I would just use a sequence, but our sequences are cached for performance reasons... )
NTP sync doesn't help this situation because NTP has a fault tolerance of 128ms -- which leaves the door open for records to be recorded out of order when they occur more frequently than that.
I have a feeling I'm looking at this problem the wrong way. My ultimate goal is to be able to retrieve the actual sequence that log entries are recorded. It doesn't have to be by a timestamp column.
An Oracle sequence with ORDER specified is guaranteed to return numbers in order across a RAC cluster. So
create sequence my_seq
start with 1
increment by 1
order;
Now, in order to do this, that means that you're going to be doing a fair amount of inter-node communication in order to ensure that access to the sequence is serialized appropriately. That's going to make this significantly more expensive than a normal sequence. If you need to guarantee order, though, it's probably the most efficient approach that you're going to have.
Bear in mind that an attached timestamp on a row is generated at time of the insert or update, but the time that the actual change to the database takes place is when the commit happens - which, depending on the complexity of the transactions, row 1 might get inserted before row2, but gett committed after.
The only thing I am aware of in Oracle across the nodes that guarantees the order is the SCN that Oracle attaches to the transaction, and by which transactions in a RAC environment can be ordered for things like Streams replication.
1000ms? It is one sec, isn't it? IMHO it is a lot. If you really need precise time, then simply give up the idea of global time. Generate timestamps on log server and assume that each log server has it's own local time. Read something about Lamport's time, if you need some theory. But maybe the source of your problem is somewhere else. RAC synchronises time between nodes, and it would log some bigger discrepancy.
If two consecutive events are logged by two different connections, is the same thread using both connections? Or are those evens passed to background threads and then those threads write into the database? i.e. is it logged sequentially or in parallel?
My question is on sharded counters and whether you could have too many. note the below is just a made up example.
Say you want to keep a hit count of different pages on your site. So to prevent datastore contention you decide to shard the hit counters for each page. Now the number of pages grows, hence the number of sharded counters grow.
Assuming you are following the typical sharded examples, each sharded counter has its own kind allowing a query to be built that retrieves all entries blowing to a kind i.e. all entities belonging to that particular sharded counter.
My questions are:
Will a large number of counters (not shards per counter)
) affect performance as there will be so many entity kinds?
Is this the best practise? I mean it looks ugly in the datastore viewer when you have loads of entity kinds as each kind is a sharded counter for a page on your site.
If the above is not good, what would be a better solution?
If you followed what you call the "typical shard counter" examples, you can see that there's only one counter type, but you can create different string keys to count different things.
That way you have only one ShardCounter type in your db, but many-many instances with different string keys.
We have a system similar to what you've described. Using only one type of counter we count more than a hundred event types, summing up to around million hits a day. So it's safe to assume that it's pretty scalable ;)
EDIT added counter code examples from Google's documentation:
In the last example you will a counter that has a SHARD_KEY_TEMPLATE variable at the top of the code. This last example allows having different counters with the same shard class.
https://cloud.google.com/appengine/articles/sharding_counters?hl=en
CREATE SEQUENCE S1
START WITH 100
INCREMENT BY 10
CACHE 10000000000000000000000000000000000000000000000000000000000000000000000000
If i fire a query with such a big size even if it creates the sequence s1.
What is the max size that I can provide with it???
http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/statements_6015.htm#SQLRF01314
Quote from 11g docs ...
Specify how many values of the sequence the database preallocates and keeps in memory for faster access. This integer value can have 28 or fewer digits. The minimum value for this parameter is 2. For sequences that cycle, this value must be less than the number of values in the cycle. You cannot cache more values than will fit in a given cycle of sequence numbers. Therefore, the maximum value allowed for CACHE must be less than the value determined by the following formula:
(CEIL (MAXVALUE - MINVALUE)) / ABS (INCREMENT)
If a system failure occurs, then all cached sequence values that have not been used in committed DML statements are lost. The potential number of lost values is equal to the value of the CACHE parameter.
Determining the optimal value is a matter of determining the rate at which you will generate new values, and thus the frequency with which recursive SQL will have to be executed to update the sequence record in the data disctionanry. Typically it's higher for RAC systems to avoid contention, but then they are also generally busier as well. Performance problems relating to insufficient sequence cache are generally easy to sport through AWR/Statspack and other diagnostic tools.
Looking in the Oracle API, I don't see a maximum cache size specified (Reference).
Here are some guidelines on setting an optimal cache size.