Cassandra -What is really happens once Key-cache get filled - caching

Consider that I have configured 1 Mb of key-cache (Consider it can hold 13000 of keys ).
Then I wrote some records in a column family(say 20000).
Then read it at first (All keys sequentially in the same order used to write ), and keys are started to stored in key-cache.
When the read reached # 13000 the key cache is filled completely.
What will happen to the key-cache when the next keys are read? (Which key is removed for the newly read key ?).
Key-Cache following FIFO or LIFO or Random out ?.

Key cache uses ConcurrentLinkedHashMap underneath and hence its eviction policy is LRU (least recently used).
https://code.google.com/p/concurrentlinkedhashmap/#Features
https://code.google.com/p/concurrentlinkedhashmap/wiki/Design#Beyond_LRU

Related

Is it better to use txt file to get the current counter value instead of database?

I am working on a website in laravel, wherein I am loading a current counter value from the database. And then the user clicks on the button to increase the score.
But as the website has around 4000 concurrent users at any given time, the Database connection is taking its toll on the server and resulting in timeouts.
If I load the current score from the txt file and then write it back to the same file, will it be better?
Or should I use an Application variable to store the score?
I have tried using the cache, but it doesn't pull the latest value. Database optimization is also not working due to the number of users.
I am looking at best way to show and increment counter without database usage.
A database would do a better job. A NoSQL database is perfect for your use case. You can use Redis, it stores the data in-memory (RAM), which means read and write operations will be much faster than other database that operates in secondary disk (Hard Drive).
Redis itself supports data structure to increment values, using INCR command. INCR increments the number stored at key by one. If the key does not exist, it is set to 0 before performing the operation.
For example your key that holds the value is my_counter. You can play around with redis-cli like so.
redis> SET my_counter "10"
"OK"
redis> INCR my_counter
(integer) 11
redis> GET my_counter
"11"
Fortunately, there is a Redis client for Laravel. You can have a read here:
https://laravel.com/docs/5.8/redis
Good luck :)
Edit 1:
If a high amount of user is causing the server to slow down, you have other server and architectural options that can be set alongside a new database. Such as horizontal and vertical scaling.
References:
https://github.com/phpredis/phpredis

Downside of many caches in spring

Due to the limitation of not being able to evict entries based on a partial key, I am thinking of a workaround using the cache name as my partial key and evicting all (there would only be one) entries in the cache. For example, let's say there are 2 key-value pairs like so:
"123#name1" -> value1,
"124#name2" -> value2
Ideally, at the time of eviction, I would like to remove all keys that contain the string "123". However, as this is not supported, the workaround I'm thinking of is to have the following:
"123" cache: "name1" -> value1
"124" cache: "name2" -> value2
Then at eviction, I would simply specify to remove all keys in "123" cache
The downside of this of course is that there would be a lot of different caches. Is there any performance penalty to this?
From reading this, it seems Redis at least only uses the cache name as a prefix. So it is not creating multiple separate caches underneath it. But I would like to verify my understanding.
I am also looking to use Redis as my underlying cache provider if that helps.
You can use few approaches to overcome this :
Use grouped data structures like sets, sorted sets and hashes : Each one of them supports really high number of member elements. So you can use them to store your cache items,and do the relevant lookups. However, do go through the performance difference ( would be very small ) on this kind of lookup compared to a direct key-value lookup.
Once you want to evict a group of cache keys which are of similar type, you just remove that data structure key from redis.
Use redis database numbers : You would need to edit redis.conf to increase maximum number of redis database numbers possible. Redis databases are just numbers which provide namespacing in which your key-values can lie. To group similar items, you would put them in the same database number, and empty the database with a single command whenever you want to flush that group of keys.
The caveat here is that, though you would be able to use same redis connection, you would have to switch databases through redis SELECT command

how to keep caching up to date

when memecached or Redis is used for data-storage caching. How is the cache being updated when the value changed?
For, example. If I read key1 from cache the first time and it missed, then I pull value1 and put key1=value1 into cache.
After that if the value of key1 changed to value2.
How is value in cache updated or invalidated?
Does that mean whenever there is a change on key1's value. Either the application or database need to check if this key1 is in cache and update it?
Since you are using a cache, you have to tolerate the data inconsistency problem, i.e. at some time point, data in cache is different from data in database.
You don't need to update the value in cache, whenever the value has been changed. Otherwise, the whole cache system will be very complicated (e.g. you have to maintain a list of keys that have been cached), and it also might be unnecessary to do that (e.g. the key-value might be used only once, and no need to update it any more).
How can we update the data in cache and keep the cache system simple?
Normally, besides setting or updating a key-value pair in cache, we also set a TIMEOUT for each key. After that, client can get the key-value pair from the cache. However, if a key reaches the timeout, the cache system removes the key-value pair from the cache. This is called THE KEY HAS BEEN EXPIRED. The next time, the client trying to get that key from cache, will get nothing. This is called CACHE MISS. In this case, client has to get the key-value pair from database, and update it to cache with a new timeout.
If the data has been updated in database, while the key has NOT been expired in cache, client will get inconsistent data. However, when the key has been expired, its value will be retrieved from database and inserted into cache by some client. After that, other clients will get updated data until the data has been changed again.
How to set the timeout?
Normally, there're two kinds of expiration policy:
Expire in N seconds/minutes/hours...
Expire at some future timepoint, e.g. expire at 2017/7/30 00:00:00
A large timeout can largely reduce the load of database, while the data might be out-of-date for a long time. A small timeout can keep the data up-to-date as much as possible, while the database will have a heavy load. So you have to balance the trade-off when designing the timeout.
How does Redis expire keys?
Redis has two ways to expire keys:
When client tries to operate on a key, Redis checks if the key has reached the timeout. If it does, Redis removes the key, and acts as if the key doesn't exist. In this way, Redis ensures that client doesn't get expired data.
Redis also has an expiration thread that samples keys at a configured frequency. If the keys reach the timeout, Redis removes these keys. In this way, Redis can accelerate the key expiration process.
You can simply empty the particular cache value in the api function where insertion or updation of that particular value is performed. This way the server will fetch the updated value in the next request because you had already emptied the cache value.
Here is a diagram which will make it easier for you to understand:
I had similar issue related to stale data esp. in two cases:
When i get bulk messages/events
In this (my) use case, I am writing score to Redis cache and reading it again in subsequent call. In case of bulk messages, due to weak consistency in Redis, data might not be replicated to all replicas when I request again to read the data against same key(which is generally few ms(1-2 ms).
Remediation:
In this case, I was getting stale data. In order to address that, used cache on cache i.e. Loading TTL cache on Redis Cache. Here, it used to check the data in loading cache first, if not present, it checks data in Redis cache. Once done, both the caches are being updated.
in distributed system(k8s) where I have multiple pods
(kafka is being used as messaging broker)
When went for above strategy, we have another problem, what if data for a key previously served by say pod1, reaches to pod2. This has bigger impact, as it leads to data inconsistencies.
Remediation:
Here kafka partition key was set as "key" which is set in Redis. This way, we are getting subsequent messages to a particular pod only. In case of restart of pods, cache will be build again.
This solved our problem.

Count redis keys without fetching them in Ruby

I'm keeping a list of online users in Redis with one key corresponding to one user. Keys are set to time out in 15 minutes, so all I have to do to see how many users have roughly been active in the past 15 minutes, I can do:
redisCli.keys('user:*').count
The problem is as the number of keys grows, the time it takes to fetch all the keys before counting them is increasing noticeably. Is there a way to count the keys without actually having to fetch all of them first?
There is an alternative to directly indexing keys in a Set or Sorted Set, which is to use the new SCAN command. It depends on the use case, memory / speed tradeoff, and required precision of the count.
Another alternative is that you use Redis HyperLogLogs, see PFADD and PFCOUNT.
Redis does not have an API for only counting keys with a specific pattern, so it is also not available in the ruby client.
What I can suggest is to have another data-structure to read to number of users from.
For instance, you can use redis's SortedSet, where you can keep each user with the timestamp of its last TTL set as the score, then you can call zcount to get the current number of active users:
redisCli.zcount('active_users', 15.minutes.ago.to_i, Time.now.to_i)
From time to time you will need to clean up the old values by:
redisCli.zremrangebyscore 'active_users', 0, 15.minutes.ago.to_i

does Firebird defrag? If so, like a clustered index?

I've seen a few (literally, only a few) links and nothing in the documentation that talks about clustering with Firebird, that it can be done.
Then, I shot for the moon on this question CLUSTER command for Firebird?, but answerer told me that Firebird doesn't even have clustered indexes at all, so now I'm really confused.
Does Firebird physically order data at all? If so, can it be ordered by any key, not just primary, and can the clustering/defragging be turned on and off so that it only does it during downtime?
If not, isn't this a hit to performance since it will take the disk longer to put together disparate rows that naturally should be right next to each other?
(DB noob)
MVCC
I found out that Firebird is based upon MVCC, so old data actually isn't overwritten until a "sweep". I like that a lot!
Again, I can't find much, but it seems like a real shame that data wouldn't be defragged according to a key.
This says that database pages are defragmented but provides no further explanation.
Firebird does not cluster records. It was designed to avoid the problems that require clustering and the fragmentation problems that come with clustered indexes. Indexes and data are stored separately, on different types of pages. Each data page contains data from only one table. Records are stored in the order they were inserted, give or take concurrent inserts, which generally go on separate pages. When old records are removed, new records will be stored in their place, so new records sometimes appear on the same page as older ones.
Many tables use an artificial primary key, generally ascending, which might be a database generated sequence or a timestamp. That practice causes records to be stored in key order, but that order is by no means guaranteed. Nor is it very interesting. When the primary key is artificial, most queries that return groups of related records are done on secondary indexes. That's a performance hit for records that are clustered because look-ups on secondary indexes require traversing two indexes because the secondary index provides only the key to the primary index, which must be traversed to find the data.
On the larger issue of defragmentation and space usage, Firebird tracks the free space on pages so new records will be inserted on pages that have had records removed. If a page becomes completely empty, it will be reallocated. This space management is done as the database runs. As you know, Firebird uses Multi-Version Concurrency Control, so when a record is updated or deleted, Firebird creates a new record version, but keeps the old version around. When all transactions that were running before the change was committed have ended, the old record version no longer serves any purposes, and Firebird will remove it. In many applications, old versions are removed in the normal course of running the database. When a transaction touches a record with old versions, Firebird checks the state of the old versions and removes them if no running transaction can read them. There is a function called "Sweep" that systematically removes unneeded old record versions. Sweep can run concurrently with other database activity, though it's better to schedule it when the database load is low. So no, it's not true that nothing is removed until you run a sweep.
Best regards,
Ann Harrison
who's worked with Firebird and it's predecessors for an embarassingly long time
BTW - as the first person to answer mentioned, Firebird does leave space on pages so that the old version of a record stays on the same page as the newer version. It's not a fixed percentage of the space, but 16 bytes per record stored on the page, so pages of tables with very short records have more free space and tables that have long records have less.
On restore, database pages are created ~70% full (as I recall, unless you specify gbak's -use_all_space switch) and the restore is done one table at a time, writing pages to the end of the database file as needed. You can imagine a scenario where pages could be condensed down to much less. Hence bringing the data together and "defragging" it.
As far as controlling the physical grouping on disk or doing an online defrag -- in Firebird there is none. Remember that just because you need to access a page does not mean your disk does a read -- file system and database cache can avoid it!

Resources