In a Neo4J Causal Clustering, implemented using 3 AWS EC2 Instances, should the data be written only to Leader Node or can it be written to any Follower Nodes?
Data should always be written to a core server, these writes must be acknowledged by a majority of core servers.
Reads can be from any the read replicas.
Using bookmarks, you are able to "read your own writes", thus making sure of showing consistent data.
Related
I'd like to understand the capabilities of Aeron Clusters with respect to sharing requests across different back-end cluster application instances. I am thinking of something similar to partitions in Kafka where distinct back-end consumer processes the workload in independent processes. There should be a partition key which defines how to find the partition, or it could be a consumer provided hash, etc.
I read this article but it was not much help https://aeroncookbook.com/aeron-cluster/on-sharding/
So far I have only been reading the documentation and the API documents.
I also read the aeoroncookbook site: https://aeroncookbook.com/aeron-cluster/on-sharding/
Could someone provide an example of this if it is possible? The cookbook does not really do much good here because it imposes a similar problem but with dependencies between the shards.
Aeron Cluster does not directly support sharding. Its primary goal is redundant copies of the same data across multiple nodes. Sharding would need to be something that layered on via your own application logic. An approach would be to run multiple clusters and utilize a key to partition data across the clusters, then within your client application run multiple cluster clients (one for each cluster) and select the approach client based on the data that you are interacting with.
How is data consistency handled in the distributed cache using Oracle coherence where each cluster node is responsible only for a piece of data?
I also have confusion about below
Are cluster nodes on different servers and each has its own local cache?
For instance say I have node A, with cache "a" and node B and with cache "b", is the database on a
separate server D?
When is an update, is update first made on D and written back to cache a and b, or how does data consistency work.
Explanation in laymen terms will be helpful as I am new to Oracle Cohernace
Thank you!
Coherence uses two different distribution mechanisms: full replication and data partitioning; each distributed cache is configured to use one of these. Most caches in most large systems use the partitioned model, because they scale very well, adding storage with each server and maintaining very high performance even up to hundreds of servers.
The Coherence software architecture is service based; when Coherence starts, it first creates a local service for managing clustering, and that service communicates over the network to locate and then join (or create, if it is the first server running) the cluster.
If you have any partitioned caches, then those are managed by partitioned cache service(s). A partitioned cache service coordinates across the cluster to manage the entirety of the partitioned cache. It does this dynamically, starting by dividing the responsibilities of data management evenly across all of the storage enabled nodes. The data in the cache(s) is partitioned, which means "sliced up", so that some values will go to server 1, some values to server 2, etc. The data ownership model prevents any confusion about who owns what, so even if a message gets delayed on the network and ends up at the wrong server, no damage is done, and the system self-corrects. If a server dies, whatever data (slices) it was managing is backed up by one or more other server, and the servers work together to ensure that new back-ups are made for any data that does not have the desired number of backups. It is a dynamic system.
There are several different APIs provided to an application, starting with an API as simple as using a hash map (in fact it is the Java Map API).
I am new to the big-data tech stack in general. I am implementing a real time analytics infrastructure that will ingest high volume/velocity data from different services in our micro services backend. The ingested data ( and data stream ) will be used to populate dashboards for key business metrics and for BI queries and machine learning.
All of the backend services write the data events into a Kafka cluster that is now in place. I started working on a Spark prototype to read the data from the Kafka cluster and enrich/process it.
Now i am working on where to store the data at rest. I know for real time analytics Technologies like Vertica and Terradata are fairly popular. But they have non-trivial capital investment upfront.
So i am trying to stick to open source. After a bit of study i decided to use HDFS/Impala for the data at rest and running SQL on Hadoop for our real time BI queries.
I then started thinking if instead of using HDFS/Impala, it makes more sense to use Cassandra for storing our data at rest. Cassandra scales out and has fast writes and reads. I also read some literature where people gave good arguments for using C* for such use.
Any comment/feedback is welcome.
We store petabytes of expiring time series data in Cassandra, and we're very happy with it. In the ingestion pipeline, we're capable of many millions of writes per second, and reading is fast (sub-millisecond) for displaying/BI. For large ML tasks, you can run spark on top of Cassandra for analysis.
For example, if I have a GetFile processor that I have designated to be isolated, how do the flow files coming from that processor get distributed across the cluster nodes?
Is there any additional work / processors that need to be added?
In Apache NiFi today the question of load balancing across the cluster has two main answers. First, you must consider how data gets to the cluster in the first place. Second, once it is in the cluster do you need to rebalance.
For getting data into the cluster it is important that you select protocols which are themselves scalable in nature. Protocols which offer queuing semantics are good for this whereas protocols which do not offer queuing semantics are problematic. As an example of one with queueing semantics think JMS queues or Kafka or some HTTP APIs. Those are great because one or more clients can pull from them in a queue fashion and thus spread the load. An example of a protocol which does not offer such behavior would bet GetFile or GetSFTP and so on. These are problematic because the client(s) have to share state about which data they see to pull. To address even these protocols we've moved to a model of 'ListSFTP' and 'FetchSFTP' where ListSFTP occurs on one node in the cluster (primary node) and then it uses Site-to-Site feature of NiFi to load balance to the rest of the cluster then each node gets its share of work and does FetchSFTP to actually pull the data. The same pattern is offered for HDFS now as well.
In describing that pattern I also mentioned Site-to-Site. This is how two nifi clusters can share data which is great for Inter-site and Instra-Site distribution needs. It also works well for spreading load within the same cluster. For this you simply send the data to the same cluster and NiFi takes care then of load balancing and fail-over and detection of new nodes and removed nodes.
So there are great options already. That said we can do more and in the future we plan to offer a way for you to on a connection indicate it should be auto-load-balanced and then it will behind the scenes do what I've described.
Thanks
Joe
Here is an updated answer, that works even simpler with newer versions of NiFi. I am running Apache NiFi 1.8.0 here.
The approach I found here is to use a processor on the primary node, that will emit flow files to be consumed via a load balanced connection.
For example, use one of the List* processors, in "Scheduling" set its "Execution" to run on the primary node.
This should feed into the next processor. Select the connection and set its "Load Balance Strategy".
You can read more about the feature in its design document.
I am using client side partitioning on a 4 node redis setup. The writes and reads are distributed among the nodes. Redis is used as a persistence layer for volatile data as well as a cache by different parts of application. We also have a cassandra deployment for persisting non-volatile data.
On redis we peak at nearly 1k ops/sec (instantaneous_ops_per_sec). The load is expected to increase with time. There are many operations where we query for a non-existent key to check whether data is present for that key.
I want to achieve following things:
Writes should failover to something when a redis node goes down.
There should be a backup for reading the data lost when the redis node went down.
If we add more redis nodes in the future (or a dead node comes back up), reads and writes should be re-distributed consistently.
I am trying to figure out suitable design to handle the above scenario. I have thought of the following options:
Create hot slaves for the existing nodes and swap them as and when a master goes down. This will not address the third point.
Write a Application layer to persist data in both redis and cassandra allowing a lazy load path for reads when a redis node goes down. This approach will have an overhead of writing to two stores.
Which is a better approach? Is there a suitable alternative to the above approaches?
A load of 1k ops/s is far below the capabilities of Redis. You would need to increase by up to two or more orders of magnitude before you come close to overloading it. If you aren't expecting to exceed 50-70,000 ops/second and are not exceeding your available single/0-node memory I really wouldn't bother with sharding your data as it is more effort than it is worth.
That said, I wouldn't do sharding for this client-side. I'd look at something like Twemproxy/Nutcracker to do it do you. This provides a path to a Redis Cluster as well as the ability to scale out connections and proved transparent client-side support for failover scenarios.
To handle failover in the client you would want to set up two instances per slot (in your description a write node) with one shaved to the other. Then you would run a Sentinel Constellation to manage the failover.
Then you would need to have your client code connect to sentinel to get the current master connectivity for each slot. This also means client code which can reconnect to the newly promoted master when a failover occurs. If you have load Balancers available you can place your Redis nodes behind one or more (preferably two with failover) and eliminated client reconnection requirements, but you would then need to implement a sentinel script or monitor to update the load balancer configuration on failover.
For the Sentinel Constellation a standard 3 node setup will work fine. If you do your load balancing with software in nodes you control it would be best to have at least two sentinel nodes on the load Balancers to provide natural connectivity tests.
Given your description I would test out running a single master with multiple read slaves, and instead of hashing in client code, distribute reads to slaves and writes to master. This will provide a much simpler setup and likely less complex code on the client side. Scaling read slaves is easier and simpler, and as you describe it the vast majority if ops will be read requests so it fits your described usage pattern precisely.
You would still need to use Sentinel to manage failover, but that complexity will still exist, resulting in a net decrease in code and code complexity. For a single master, sentinel is almost trivial so setup; the caveats being code to either manage a load balancer or Virtual IP or to handle sentinel discovery in the client code.
You are opening the distributed database Pandora's box here.
My best suggestion is; don't do it, don't implement your own Redis Cluster unless you can afford loosing data and / or you can take some downtime.
If you can afford running on not-yet-production-ready software, my suggestion is to have a look at the official Redis Cluster implementation; if your requirements are low enough for you to kick your own cluster implementation, chances are that you can afford using Redis Cluster directly which has a community behind.
Have you considered looking at different software than Redis? Cassandra,Riak,DynamoDB,Hadoop are great examples of mature distributes databases that would do what you asked out of the box.