Keep in-memory data structures of two processes in sync - algorithm

How do I keep the in-memory data structures in sync between two processes. Both processes are the same process(server) - one is active and the other one is a stand-by. The stand-by needs to take over in case of crash/or similar of the active. For the standby to take over the active, the in-memory data structures need to be kept in-sync. Can I use Virtual Synchrony? Will it help? If it would is there any library that I can use? I am coding on C++ on Windows(Visual Studio).
If that is not a solution what is a good solution I can refer to?
TIA

The easiest solution to implement is to store the state in a separate database, so that when you fail over, the standby will just continue using the same database. If you are worried about the database crashing, pay the money and complexity required to have main and standby databases, with the database also failing over. This is an attempt to push the complexity of handling state across failovers onto the database. Of course you may find that the overhead of database transactions becomes a bottleneck. It might be tempting to go NoSQL for this, but remember that you are probably relying on the ACID guarantees you get with a traditional database. If you ditch these, typically getting eventual consistency in return, you will have to think about what this means on failover. Will you lose a small amount of recent information on failover? do you care?
Virtual synchrony looks interesting. I have searched for similar things and found academic pages like http://www.cs.cornell.edu/ken/, some of which, like this, have links to open source software produced by research groups. I have never used them. I seem to remember reports that they worked pretty well for small number of machines with very good connectivity, but hit performance problems with scale, which I presume won't be a problem for you.
Once upon a time people built multiprocess systems on Unix machines by having the processes communicate via shared memory, or memory mapped files. For very simple data structures, this can be made to work. One problem you have is if one of the processes crashes halfway through modifying the shared data - will this mess up the other processes? You can solve these problems, but you are in danger of discovering that you have implemented everything inside the database that I described in my first paragraph.

You can go for in memory database like memcached or redis.

Related

Which caching mechanism to use in my spring application in below scenarios

We are using Spring boot application with Maria DB database. We are getting data from difference services and storing in our database. And while calling other service we need to fetch data from db (based on mapping) and call the service.
So to avoid database hit, we want to cache all mapping data in cache and use it to retrieve data and call service API.
So our ask is - Add data in Cache when it gets created in database (could add up-to millions records) and remove from cache when status of one of column value is "xyz" (for example) or based on eviction policy.
Should we use in-memory cache using Hazelcast/ehCache or Redis/Couch base?
Please suggest.
Thanks
I mostly agree with Rick in terms of don't build it until you need it, however it is important these days to think early of where this caching layer would fit later and how to integrate it (for example using interfaces). Adding it into a non-prepared system is always possible but much more expensive (in terms of hours) and complicated.
Ok to the actual question; disclaimer: Hazelcast employee
In general for caching Hazelcast, ehcache, Redis and others are all good candidates. The first question you want to ask yourself though is, "can I hold all necessary records in the memory of a single machine. Especially in terms for ehcache you get replication (all machines hold all information) which means every single node needs to keep them in memory. Depending on the size you want to cache, maybe not optimal. In this case Hazelcast might be the better option as we partition data in a cluster and optimize the access to a single network hop which minimal overhead over network latency.
Second question would be around serialization. Do you want to store information in a highly optimized serialization (which needs code to transform to human readable) or do you want to store as JSON?
Third question is about the number of clients and threads that'll access the data storage. Obviously a local cache like ehcache is always the fastest option, for the tradeoff of lots and lots of memory. Apart from that the most important fact is the treading model the in-memory store uses. It's either multithreaded and nicely scaling or a single-thread concept which becomes a bottleneck when you exhaust this thread. It is to overcome with more processes but it's a workaround to utilize todays systems to the fullest.
In more general terms, each of your mentioned systems would do the job. The best tool however should be selected by a POC / prototype and your real world use case. The important bit is real world, as a single thread behaves amazing under low pressure (obviously way faster) but when exhausted will become a major bottleneck (again obviously delaying responses).
I hope this helps a bit since, at least to me, every answer like "yes we are the best option" would be an immediate no-go for the person who said it.
Build InnoDB with the memcached Plugin
https://dev.mysql.com/doc/refman/5.7/en/innodb-memcached.html

Why does Hadoop follow WORM( write once read many times) and does not allow updates?

Hadoop follows WORM (write once read many times). Why does it not allow any updates?
thanks
The question really is what is the motivation for updating data? We store our entities in the database and update them as new information is seen, but why? The reason is that when it was first being architected, disk space was expensive. Fast-forward to present day and disk space is cheap, which means that we can afford to reflect changes to data as new entries, like a log of the changes that the entities go through in their lifespan.
By using this approach, the lineage of the data is more apparent - we simply revisit older versions of the same entity to discover where it has come from and what transformations have been applied to it. Moreover, if something were to happen to the latest version, all is not lost. We simply drop back to an older version and state loss is minimal. This is obviously preferable to updated entities, in which entire entities can be lost and potentially never recovered.
This is documented very well in Nathan Marz and James Warren's 'Big Data - Principles and Practices of Scalable Real-time Data Systems'.
It was easier. More precisely, for reliable writes in a distributed cluster with complex failure patterns, significantly easier. And, with applications that are written for append-only/log based operations, works well.
You can now append to HDFS (Hadoop 2.6+ recommended), but you can only write exactly at the end of the file; you can't seek() to earlier in the file, or past the current EOF, then write.
Will this ever be fixed? Maybe. But recent work on encryption at rest and erasure coding has focused more on compressing and encrypting the existing data, which could potentially make seek+write even harder. I'd recommend not waiting for this feature, but writing code which works within the constraints (as HBase and accumulo do).

Balancing Redis queries and in-process memory?

I am a software developer but wannabe architect new to the server scalability world.
In the context of multiple services working with the same data set, aiming to scale for redundancies and load balancing.
The question is: In a idealistic system, should services try to optimize their internal processing to reduce the amount of queries done to the remote server cache for better performance and less bandwidth at the cost of some local memory and code base or is it better to just go all-in and query the remote cache as the single transaction point every time any transaction need processing done on the data?
When I read about Redis and even general database usage online, the later seems to be the common option. Every nodes of the scaled application have no memory and read and write directly to the remote cache on every transactions.
But as a developer, I ask if this isn't a tremendous waste of resources? Whether you are designing at electronic chips level, at inter-thread, inter-process or inter-machine, I do believe it's the responsibility of each sub-system to do whatever it can to optimize its processing without depending on the external world if it can and hence reduce overall operation time.
I mean, if the same data is read over hundreds or time from the same service without changes (write), isn't it just more logical to keep a local cache and wait for notifications of changes (pub/sub) and only read only these changes to update the cache instead reading the bigger portion of data every time a transaction require it? On the other hand, I understand that this method implies that the same data will be duplicated at multiple place (more ram usage) and require some sort of expiration system not to keep the cache from filling up.
I know Redis is built to be fast. But however fast it is, in my opinion there's still a massive difference between reading directly from local memory versus querying an external service, transfer data over network, allocating memory, deserialize into proper objects and garbage collect it when you are finished with it. Anyone have benchmark numbers between in-process dictionaries query versus a Redis query on the localhost? Is it a negligible time in the bigger scheme of things or is it an important factor?
Now, I believe the real answer to my question until now is "it depends on your usage scenario", so let's elaborate:
Some of our services trigger actions on conditions of data change, others periodically crunch data, others periodically read new data from external network source and finally others are responsible to present data to users and let them trigger some actions and bring in new data. So it's a bit more complex than a single web pages deserving service. We already have a cache system codebase in most services, and we have a message broker system to notify data changes and trigger actions. Currently only one service of each type exist (not scaled). They transfer small volatile data over messages and bigger more persistent (changing less often) data over SQL. We are in process of moving pretty much all data to Redis to ease scalability and performances. Now some colleagues are having a heated discussion about whether we should abandon the cache system altogether and use Redis as the common global cache, or keep our notification/refresh system. We were wondering what the external world think about it. Thanks
(damn that's a lot of text)
I would favor utilizing in-process memory as much as possible. Any remote query introduces latency. You can use a hybrid approach and utilize in-process cache for speed (and it is MUCH faster) but put a significantly shorter TTL on it, and then once expired, reach further back to Redis.

Most efficient way to cache in a fastcgi app

For fun i am writing a fastcgi app. Right now all i do is generate a GUID and display it at the top of the page then make a db query based on the url which pulls data from one of my existing sites.
I would like to attempt to cache everything on the page except for the GUID. What is a good way of doing that? I heard of but never used redis. But it appears its a server which means its in a seperate process. Perhaps an in process solution would be faster? (unless its not?)
What is a good solution for page caching? (i'm using C++)
Your implementation sounds like you need a simple key-value caching mechanism, and you could possibly use a container like std::unordered_map from C++11, or its boost cousin, boost::unordered_map. unordered_map provides a hash table implementation. If you needed even higher performance at some point, you could also look at Boost.Intrusive which provides high performance, standard library-compatible containers.
If you roll your cache with the suggestions mentioned, a second concern will be expiring cache entries, because of the possibility your cached data will grow stale. I don't know what your data is like, but you can choose to implement a caching strategy like any of these:
after a certain time/number of uses, expire a cached entry
after a certain time/number of uses, expire the entire cache (extreme)
least-recently used - there's a stack overflow question concerning this: LRU cache design
Multithreaded/concurrent access may also be a concern, though as suggested in the link above, a possibility would be to lock the cache on access rather than worry about granular locking.
Now if you're talking about scaling, and moving up to multiple processes, and distributing server processes across multiple physical machines, the simple in-process caching might not be the way to go anymore (everyone could have different copies of data at any given time, inconsistency of performance if some server has cached data but others don't).
That's where Redis/Memcached/Membase/etc. shine - they are built for scaling and for offloading work from a database. They could be beaten out by a database and in-memory cache in performance (there is latency, after all, and a host of other factors), but when it comes to scaling, they are very useful and save load from a database, and can quickly serve requests. They also come with features cache expiration (implementations differ between them).
Best of all? They're easy to use and drop in. You don't have to choose redis/memcache from the outset, as caching itself is just an optimization and you can quickly replace the caching code with using, say, an in-memory cache of your own to using redis or something else.
There are still some differences between the caching servers though - membase and memcache distribute their data, while redis has master-slave replication.
For the record: I work in a company where we use memcached servers - we have several of them in the data center with the rest of our servers each having something like 16 GB of RAM allocated completely to cache.
edit:
And for speed comparisons, I'll adapt something from a Herb Sutter presentation I watched long ago:
process in-memory -> really fast
getting data from a local process in-memory data -> still really fast
data from local disk -> depends on your I/O device, SSD can be fast, but mechanical drives are glacial
getting data from remote process (in-memory data) -> fast-ish, and your cache servers better be close
getting data from remote process (disk) -> iceberg

Best practices for deploying a high performance Berkeley DB system

I am looking to use Berkeley DB to create a simple key-value storage system. The keys will be SHA-1 hashes, so they are in 160-bit address space. I have a simple server working, that was easy enough thanks to the fairly well written documentation from Berkeley DB website. However, I have some questions about how best to set up such a system, to get good performance and flexibility. Hopefully, someone has had more experience with Berkeley DB and can help me.
The simplest setup is a single process, with a single thread, handling a single DB; inserts and gets are performed on this one DB, using transactions.
Alternative 1: single process, multiple threads, single DB; inserts and gets are performed on this DB, by all the threads in the process.
Does using multiple threads provide much performance improvements? There is one single DB, and therefore it's on one disk, and therefore I am guessing I won't get too much boost. But if Berkeley DB caches a lot of stuff in memory, then perhaps one thread will be able to run and answer from cache while another has blocked waiting for disk? I am using GNU Pth, user level cooperative threading. I am not familiar with the details of Pth, so I am also not sure if with Pth you can have a userlevel thread run while another userlevel thread has blocked.
Alternative 2: single process, one or multiple threads, multiple DBs where each DB covers a fraction of the 160-bit address space for keys.
I see a few advantages in having multiple DBs: we can put them on different disks, less contention, easier to move/partition DBs onto different physical hosts if we want to do that. Does anyone have experience with this setup and see significant benefits?
Alternative 3: multiple processes, each with one thread, each handles a DB that covers a fraction of the 160-bit address space for keys.
This has the advantages of using multiple DBs, but we are using multiple processes. Is this better than the second alternative? I suspect using processes rather than user-level threads to get parallelism will get you better SMP caching behaviors (less invalidates, etc), but will I get killed with all the process overheads and context switches?
I would love to hear if someone has tried the options, and have seen positive or negative results.
Thanks.
Alternative 2 gives you high scalability. You basically partition your database across
multiple servers. If you need a high performance distributed key/value database, I would
suggest looking at membase. I am doing that right now but we need to run on an appliance
and would like to limit dependencies (of membase).
You can use BerkeleyDB replication and have read only copies with servers to serve read/get
requests.

Resources