read redis when redis mass insertion - performance

I have a server running all day using redis as the data storage. There is one huge data updating (nearly 10 million rows) on a specific time (3 am) every day and a lot of real time but few data (nearly 100 rows) updating at other time.
I choose the redis mass insertion mode to accelerate the data insertion, which costs 30 seconds. But at that time, the redis query performance is really bad. Any ideas to avoid this problem?
If I use redis master-slave mode to separate the read and write, and the master writes as well as the slave reads. But when the batch insertion happens on the master, and there are also a lot of data needed to be synced to slave, and I doubt that it is still the hot spot on the slave redis for querying.
Any suggestion on this kind of senario?
Thank you.

First of all, I'd investigate where the bottleneck is. Is it network I/O? You might want to setup a dedicated route with (virtual) nic's, and even a dedicated internet connection to address that. Is it CPU? You might want to spread out the mass insertion.
If you use simple non-transactional pipelining, it should give redis time to breathe, so clients won't notice the mass insertion.
An alternative could be using client-side connections. Let the client connect to a different slave, which you promote to master temporarily. After the mass insertion is complete, you let the client connect to the 'real' slave again. You might be able to use redis sentinel for this, but with redis pub/sub you can accomplish the same result but with more control. Use a separate small redis instance with redis connection hosts/ports stored in a HSET.
Hope this helps, TW

Related

Does redis itself have cache?

What's the difference between query for a specific key once and many times? does it cost the same time every time, or 2nd will be faster than 1st? Hope someone can give a hand:) I've checked the official site and found nothing on this.
Redis is an in-memory database, which means all data it manages is stored in RAM. That being the case, there is no need/way for Redis to cache the data and every request is served in constant and predictable latency.

Azure Redis cache latency

I am working on an application having web job and azure function app. Web job generates the redis cache for function app to consume. Cache size is around 10 Mega Bytes. I am using lazy loading and all as per the recommendation. I still find that the overall cache operation is slow. Depending upon the size of the file i am processing, i may end up calling Redis cache upto 100,000 times . Wondering if I need to hold the cache data in a local variabke instead of reading it every time from redis. Has anyone experienced any latency in accessing Redis? Does it makes sense to create a singletone object in c# function app and refresh it based on some timer or other logic?
could you consider this points in your usage this is some good practices of azure redis cashe
Redis works best with smaller values, so consider chopping up bigger data into multiple keys. In this Redis discussion, 100kb is considered "large". Read this article for an example problem that can be caused by large values.
Use Standard or Premium Tier for Production systems. The Basic Tier is a single node system with no data replication and no SLA. Also, use at least a C1 cache. C0 caches are really meant for simple dev/test scenarios since they have a shared CPU core, very little memory, are prone to "noisy neighbor", etc.
Remember that Redis is an In-Memory data store. so that you are aware of scenarios where data loss can occur.
Reuse connections - Creating new connections is expensive and increases latency, so reuse connections as much as possible. If you choose to create new connections, make sure to close the old connections before you release them (even in managed memory languages like .NET or Java).
Locate your cache instance and your application in the same region. Connecting to a cache in a different region can significantly increase latency and reduce reliability. Connecting from outside of Azure is supported, but not recommended especially when using Redis as a cache (as opposed to a key/value store where latency may not be the primary concern).
Redis works best with smaller values, so consider chopping up bigger data into multiple keys.
Configure your maxmemory-reserved setting to improve system responsiveness under memory pressure conditions, especially for write-heavy workloads or if you are storing larger values (100KB or more) in Redis. I would recommend starting with 10% of the size of your cache, then increase if you have write-heavy loads. See some considerations when selecting a value.
Avoid Expensive Commands - Some redis operations, like the "KEYS" command, are VERY expensive and should be avoided.
Configure your client library to use a "connect timeout" of at least 10 to 15 seconds, giving the system time to connect even under higher CPU conditions. If your client or server tend to be under high load, use an even larger value. If you use a large number of connections in a single application, consider adding some type of staggered reconnect logic to prevent a flood of connections hitting the server at the same time.

Infinispan vs memcached for high concurrency need

My web application maintains in memory cache of domain entities which are read/written at high frequency. To make application clustered, i need to synchronize / externalize this cache.
Which will be better option amongst memcached and infinispan considering following application facts-
cache will be read/written at high frequency per second
if infinispan, data need to replicated across nodes near- real time
high concurrent write should not create conflicts issue if replication is slow.
I feel memcached will solve this purpose well since it's centralized and does not need replication delay like infinispan. Can experts provide opinion on this?
Unfortunately I'm not a Memcached expert but let me tell you more about some fundamental concepts so that you could pick the best option for your use case...
First, centralized vs decentralized - if you have only one node in your system, it will be faster (as you said there is no replication). However what will happen if the node is down? Or another scenario - what will happen if the node gets full (as you said you will perform a lot of read/writes per second)? One solution for that is to use master/slave replication where writes are propagated to the slave node asynchronously. This solution will save you in case the node is down but won't do any good if the node is full (if master node is full, slave will get full a couple of minutes later).
Data consistency - if you have more than 1 node in your system, your data might get out of sync. Imagine asynchronous replication between 2 nodes and a client connected to each of them. Both clients perform a write to the same key at the same exact moment. It might seems unlikely but believe me, with highly concurrent reads and writes it will happen. The only way to solve this problem is to use synchronous replication with majority of nodes up and running (or with so called consensus).
Back to your scenario - if a broken node is not a problem for you (for example, you can switch to some other data source automatically) and your data won't grow - go ahead for 1 node solution or master/slave replication. If your data need to be strongly consistent - make sure you're doing sync replication (and possibly with transactions but you need to refer to the user manual for guidance). Otherwise I would recommend picking a more versatile solution which will allow you to add/remove nodes without taking down whole system and will have an option for sync/async replication.
From my experience, people care too much about data consistency whereas should care much more about scalability. And a final piece of advice - please define your performance criteria before evaluating any solution (something like, my writes need to take no longer than X and reads no longer than Y. Define also confidence level for your criteria (I need 99.5% of all reads to be less than X).

Balancing Redis queries and in-process memory?

I am a software developer but wannabe architect new to the server scalability world.
In the context of multiple services working with the same data set, aiming to scale for redundancies and load balancing.
The question is: In a idealistic system, should services try to optimize their internal processing to reduce the amount of queries done to the remote server cache for better performance and less bandwidth at the cost of some local memory and code base or is it better to just go all-in and query the remote cache as the single transaction point every time any transaction need processing done on the data?
When I read about Redis and even general database usage online, the later seems to be the common option. Every nodes of the scaled application have no memory and read and write directly to the remote cache on every transactions.
But as a developer, I ask if this isn't a tremendous waste of resources? Whether you are designing at electronic chips level, at inter-thread, inter-process or inter-machine, I do believe it's the responsibility of each sub-system to do whatever it can to optimize its processing without depending on the external world if it can and hence reduce overall operation time.
I mean, if the same data is read over hundreds or time from the same service without changes (write), isn't it just more logical to keep a local cache and wait for notifications of changes (pub/sub) and only read only these changes to update the cache instead reading the bigger portion of data every time a transaction require it? On the other hand, I understand that this method implies that the same data will be duplicated at multiple place (more ram usage) and require some sort of expiration system not to keep the cache from filling up.
I know Redis is built to be fast. But however fast it is, in my opinion there's still a massive difference between reading directly from local memory versus querying an external service, transfer data over network, allocating memory, deserialize into proper objects and garbage collect it when you are finished with it. Anyone have benchmark numbers between in-process dictionaries query versus a Redis query on the localhost? Is it a negligible time in the bigger scheme of things or is it an important factor?
Now, I believe the real answer to my question until now is "it depends on your usage scenario", so let's elaborate:
Some of our services trigger actions on conditions of data change, others periodically crunch data, others periodically read new data from external network source and finally others are responsible to present data to users and let them trigger some actions and bring in new data. So it's a bit more complex than a single web pages deserving service. We already have a cache system codebase in most services, and we have a message broker system to notify data changes and trigger actions. Currently only one service of each type exist (not scaled). They transfer small volatile data over messages and bigger more persistent (changing less often) data over SQL. We are in process of moving pretty much all data to Redis to ease scalability and performances. Now some colleagues are having a heated discussion about whether we should abandon the cache system altogether and use Redis as the common global cache, or keep our notification/refresh system. We were wondering what the external world think about it. Thanks
(damn that's a lot of text)
I would favor utilizing in-process memory as much as possible. Any remote query introduces latency. You can use a hybrid approach and utilize in-process cache for speed (and it is MUCH faster) but put a significantly shorter TTL on it, and then once expired, reach further back to Redis.

Load balance/distribution for postgresql

I am coming here after spending considerable time trying to understand how to implement load balancing (distributing database processing load) between postgresql database servers.
I have a postgresql system which attracts about 100s of transactions per second and this is likely to grow. Please do note that my case has so many updates + inserts + selects as well. So any solution for me needs to cater to all insert/update and reads.
I am planning to use plproxy as suggested through db tools from skype at http://www.slideshare.net/adorepump/database-tools-by-skype.
Now I am also hearing that "postgresql streaming replication + hot standby" in postgres 9.0 can be considered
Can someone suggest me if there is any simple (or complex) solution to implement for the above scenario?
If your database is smaller than 100GB then you should first try to maximize what you can from one computer.
You'd need:
a good storage controller with large battery backed cache;
a bunch of fast disks in RAID10;
another bunch of disks in RAID10 for WAL;
more RAM than you have data;
as many fast processor cores as you can.
You'd be able to do several 1000s of tps with this one computer.
If it won't be enough I'd try to add a second hot standby server with streaming replication. You'd use it to run long running read-only report queries, backups etc. so your master server won't have to do these.
Only if it prove not enough then you should try to add more streaming replication hot standby servers to load balance read-only queries. This will be complicated though - because it is asynchronous there's delay between master confirming and stand-by seeing a change. You'd have to deal with it in your client application. Your setup will be a lot more complicated.

Resources