Putting cache entries to specific Ignite Server - caching

I have an Ignite data grid of five servers(say A,B,C,D and E). A partitioned cache has been distributed across these five servers with the number of backups set as 1.
I want to store 100 million entries in this partitioned cache. But, I want to control the partitioning of my cache entries to the Ignite servers.
Is it possible that I can direct my Ignite client to put a cache entry on a particular server (say E)?

The only way to do this is to implement your own Affinity Function instead of the ones provided out of the box. However, I would encoredge you to rethink this approach because it's not scalable. Affinity functions included in Ignite are designed to provide even distribution on any set of nodes, so you can dynamically scale up and down whenewer you need this. Your approach is much less flexible.
Also I would recommend you to go through documentation page about Affinity Collocation. Very likely this will give you hints on how to implement your logic in a better way.
And fincally, can you give some more details about your use case? I will be happy to give some advice on how to approach it.

Related

Best way to construct a cache key whose uniqueness is defined by 6 properties

Currently I am tasked to fix cache for an ecommerce like system whose prices depend on many factors. The cache backend is redis. For a given product the factors that influence the price are:
sku
channel
sub channel
plan
date
Currently the cache is structured like this in redis:
product1_channel1_subchannel1: {sku_1: {plan1: {2019-03-18: 2000}}}
The API caters to requests for multiple products, skus and all the factors above . So they decided to query all the data on a product_channel_subchannel level and filter the data in the app which is very slow. Also they have decided that, on a cache miss they will construct the cache for all skus for 90 days of data. This way only one request will face the wrath while the others gets benefited from it (only the catch is now we are busting cache more often which is also dragging the system down)
The downside of going with all these factors included in the keys is there will be too many keys. To ball park there are 400 products each made up of 20 skus with 20 channels, 200 subchannels 3 types of plans and 400 days of pricing. To avoid these many keys at some place we must group the data.
The system is currently receives about 10 rps and the has to respond within 100ms.
Question is:
Is the above cache structure fine? Or how do we go about flattening this structure?
How are caches stored in pricing systems in general. I feel like this a very trivial task nonetheless I find it very hard to justify my approaches
Is it okay to sacrifice one request to warm cache for bulk of the data? Or is it better to have a cache warming strategy?
Any sort of caching strategy will be an exercise in trade-offs. And the precise trade-offs you need to make will be dependent upon complex domain logic that you can't predict until you try it out.
What this means is that whatever you implement should be based on data and should be flexible enough to change over time as the business changes. In particular the answer to these questions:
Is it okay to sacrifice one request to warm cache for bulk of the data? Or is it better to have a cache warming strategy?
depend on how the data will be queried by your users and how long a cache miss will take. If queries tend to be clustered around certain skus, or certain dates in a predictable manner, then you should use that information to help guide cache hits and misses.
There is no way I, or anyone else, can give you a correct answer without doing proper experimentation, but we can give you some guidelines.
Here are some best practices that I would recommend when using redis for caching:
If the bottleneck is sending data from redis to the api, then consider using lua scripts to do the simple processing before any data leaves redis. But, be careful that you don't make the scripts too complex since a long-running lua script can block all other parts of redis
It looks like you are using simple get/set keys to store your data. Consider using something more complex:
a. use sorted sets (zsets) if you want to have better access to data by date (use the date as the score).
b. use hash sets to get more fine-grained access to skus
Based on your question, it looks like you will have about 1.6M keys. This is not a huge amount, but you need to make sure that redis has enough memory to store everything in ram without swapping anything to disk. This is something that we had to learn the hard way. If you are running your redis instance on linux, you must set the system's swappiness to 0, to ensure swap is never used.
But, most importantly, you need to experiment with everything until you find a good solution.

Which caching mechanism to use in my spring application in below scenarios

We are using Spring boot application with Maria DB database. We are getting data from difference services and storing in our database. And while calling other service we need to fetch data from db (based on mapping) and call the service.
So to avoid database hit, we want to cache all mapping data in cache and use it to retrieve data and call service API.
So our ask is - Add data in Cache when it gets created in database (could add up-to millions records) and remove from cache when status of one of column value is "xyz" (for example) or based on eviction policy.
Should we use in-memory cache using Hazelcast/ehCache or Redis/Couch base?
Please suggest.
Thanks
I mostly agree with Rick in terms of don't build it until you need it, however it is important these days to think early of where this caching layer would fit later and how to integrate it (for example using interfaces). Adding it into a non-prepared system is always possible but much more expensive (in terms of hours) and complicated.
Ok to the actual question; disclaimer: Hazelcast employee
In general for caching Hazelcast, ehcache, Redis and others are all good candidates. The first question you want to ask yourself though is, "can I hold all necessary records in the memory of a single machine. Especially in terms for ehcache you get replication (all machines hold all information) which means every single node needs to keep them in memory. Depending on the size you want to cache, maybe not optimal. In this case Hazelcast might be the better option as we partition data in a cluster and optimize the access to a single network hop which minimal overhead over network latency.
Second question would be around serialization. Do you want to store information in a highly optimized serialization (which needs code to transform to human readable) or do you want to store as JSON?
Third question is about the number of clients and threads that'll access the data storage. Obviously a local cache like ehcache is always the fastest option, for the tradeoff of lots and lots of memory. Apart from that the most important fact is the treading model the in-memory store uses. It's either multithreaded and nicely scaling or a single-thread concept which becomes a bottleneck when you exhaust this thread. It is to overcome with more processes but it's a workaround to utilize todays systems to the fullest.
In more general terms, each of your mentioned systems would do the job. The best tool however should be selected by a POC / prototype and your real world use case. The important bit is real world, as a single thread behaves amazing under low pressure (obviously way faster) but when exhausted will become a major bottleneck (again obviously delaying responses).
I hope this helps a bit since, at least to me, every answer like "yes we are the best option" would be an immediate no-go for the person who said it.
Build InnoDB with the memcached Plugin
https://dev.mysql.com/doc/refman/5.7/en/innodb-memcached.html

Infinispan vs memcached for high concurrency need

My web application maintains in memory cache of domain entities which are read/written at high frequency. To make application clustered, i need to synchronize / externalize this cache.
Which will be better option amongst memcached and infinispan considering following application facts-
cache will be read/written at high frequency per second
if infinispan, data need to replicated across nodes near- real time
high concurrent write should not create conflicts issue if replication is slow.
I feel memcached will solve this purpose well since it's centralized and does not need replication delay like infinispan. Can experts provide opinion on this?
Unfortunately I'm not a Memcached expert but let me tell you more about some fundamental concepts so that you could pick the best option for your use case...
First, centralized vs decentralized - if you have only one node in your system, it will be faster (as you said there is no replication). However what will happen if the node is down? Or another scenario - what will happen if the node gets full (as you said you will perform a lot of read/writes per second)? One solution for that is to use master/slave replication where writes are propagated to the slave node asynchronously. This solution will save you in case the node is down but won't do any good if the node is full (if master node is full, slave will get full a couple of minutes later).
Data consistency - if you have more than 1 node in your system, your data might get out of sync. Imagine asynchronous replication between 2 nodes and a client connected to each of them. Both clients perform a write to the same key at the same exact moment. It might seems unlikely but believe me, with highly concurrent reads and writes it will happen. The only way to solve this problem is to use synchronous replication with majority of nodes up and running (or with so called consensus).
Back to your scenario - if a broken node is not a problem for you (for example, you can switch to some other data source automatically) and your data won't grow - go ahead for 1 node solution or master/slave replication. If your data need to be strongly consistent - make sure you're doing sync replication (and possibly with transactions but you need to refer to the user manual for guidance). Otherwise I would recommend picking a more versatile solution which will allow you to add/remove nodes without taking down whole system and will have an option for sync/async replication.
From my experience, people care too much about data consistency whereas should care much more about scalability. And a final piece of advice - please define your performance criteria before evaluating any solution (something like, my writes need to take no longer than X and reads no longer than Y. Define also confidence level for your criteria (I need 99.5% of all reads to be less than X).

Caching In Laravel5.2

I want to use Redis as cache in my project, so as we know that redis store data in the memory, absolutely there are limitations on that, how long the data will persist on memory ? Do I want to implement some algorithms in that(least recently used for example) ?
There is no need of implementing algorithms explicitly. Redis comes with built in eviction policies. You can configure one of them. http://redis.io/topics/lru-cache
Redis support expiring keys after a certain time range. Suppose you need the cache only for 4 hours you can implement this. http://redis.io/commands/expire
Redis does compression for data within a range. You can implement all you hashes, sorted sets in such a way that it can hold a lot of data in a lesser memory space. http://redis.io/topics/memory-optimization
Go through all these docs, you will get a better idea on implementing. Hope this helps.

How to distribute data and computation to maximize locality?

Please bear with me, this is a basic architectural question for my first attempt at a "big data" project, but I believe your answers will be of general interest to anyone who is starting out in this field.
I've googled and read the high-level descriptions of Kafka, Storm, Memcached, MongoDB, etc., but now that I'm ready to dig in to start designing my app, I still need some further insight on how in fact the data should be distributed and shared.
The performance of my app is critical, so one objective is to somehow maximize the locality of the data in the RAM of the machines doing the distributed calculations. I need advice for this part of the design.
If my app had some clear criteria for a priori sharding the data and distributing the calculations (such as geographical regions or company divisions) then the solution would be obvious. But unfortunately my app's data access patterns are dynamic and depend on the results of previous calculations.
My app is an analysis program with distinct stages. In the first stage, all the data is accessed once and a metric is calculated for each data object. In the second stage, a subset of the data objects may be accessed, with the probability of access being proportional to each data object's metric that was calculated in the previous stage. In the final stage, a relatively small subset of data objects will be accessed many times for many calculations.
At all stages, it is required that the calculations be distributed across several servers. The calculations are embarassingly parallel, and each distributed calculation only needs to access a few data objects. It is also required that the number of servers can be specified before the app runs (for example, run on one server, or run on fifty servers).
It seems to me that I need some mechanism that distributes the appropriate data objects to the appropriate compute servers, as opposed to just blindly fetching the data from some database service (whether centralized or distributed). Also, it seems to me that some sort of smart caching system might be appropriate, since the data access pattern depends on the previous calculations and cannot be predicted a priori. But as far as I can tell, Memcached is not such a system because the sharding is determined a priori.
I've read many times that the operating system cache performs better than any monkeying around that we may try. I think the ideal solution is that each compute server's RAM cache somehow captures the data objects' dynamic access patterns, but it's not clear to me how this would work with a NoSQL or Memcached service.
Thanks for bearing with me this far. I realize this is a basic question, but the answer eludes me so far. I can't resolve the dynamic access patterns of my app with the a priori sharding of the NoSQL/Memcached packages. Any advice would be greatly appreciated.
I recommend you to take a look at http://tarantool.org. Shard to maximize locality for the most common data access pattern, use Lua for local computations, and net.box to issue a remote RPC when calculation needs to continue on another node. All data is stored in RAM, if you write your computation code carefully it could take advantage of the Just In Time compiler.

Resources