Hazelcast without Split-Brain Recovery (merge) - caching

I'm experimenting Hazelcast, but in scenario of split brain, it automatically try to merge the cluster with PutIfAbsentMergePolicy. Is there any way to ignore the merge policy and just keep working with a inconsistent cluster? I want my app to be 100% available all the time and not try to merge.

You can configure the merge policy from the existing ones or write your custom one. However, I don't think you can avoid merging the cluster back after the split brain.

Related

How to know when data has been inserted in clickhouse

I understood that clickhouse is eventually consistent. So once an insert call returns, it doesn't mean that the data will appear in a select query.
does that apply to stand-alone clickhouse (no distribution, no replication)?
I understand the concept of eventual consistency for data replication, but does it apply with distribution but no replication?
using a distributed+replicated clickhouse, what is a recommended way to know that some insert(s) can be safely looked up?
Basically I didn't find much information on this topic, so maybe I am not asking the best questions. Feel free to enlighten me.
No, but single-node setup shouldn't be considered reliable either.
By default yes, you'll insert to node the client is connected to (probably via some load balancer) and Distributed table will asynchronously forward each piece of data to node where it belongs. The insert_distributed_sync=1 setting will make the client wait synchronously.
On insert use ***MergeTree shard tables directly (not Distributed) with insert_quorum=2 setting (if there are 3 replicas) and retry infinitely with exactly same batch if there are some errors (can use different replicas on retry, since there's a deduplication based on batch hash). Then on reads use select_sequential_consistency=1 setting.

Is there someway to make distributed table still work for query when one of the shard servers down?

There is a common case that we will update the clickhouse's config which must restart clickhouse to take effect. And during the restarting, the query services depend on clickhouse's distributed table will return the exception due to disconnecting with the restarting server.
So,as the title says, what I want is the way to make distributed table still work for query when one of the shard server down. Thanks.
I see two ways:
Since this server failure is transient, you can refactor your server-side code by adding retry-policy to your request (for c# I would recommend use Polly)
Use the proxy (load-balancer) to CH (for example chproxy).
UPDATE
When one node is restarting in a cluster the distributed table created over replicated tables should be accessible (of course request shouldn't be sent to restarted node).
Availability of data is achieved by using replication, therefore, you need to create Replicated*-tables over materialized view and then create Distributed-tables over Replicated*-tables.
Please look at the articles CH Data Distribution, Distributed vs Shard vs Replicated..
and as a working example (it is not your case) to CH Circular cluster topology.

Using ElasticSearch as a permanent storage

Recently I am working on a project which is producing a huge amount of data every day, in this project, there are two functionalities, one is storing data into Hbase for future analysis, and second one is pushing data into ElasticSearch for monitoring.
As the data is huge, we should store data into two platforms(Hbase,Elasticsearch)!
I have no experience in both of them. I want no know is it possible to use elasticsearch instead of hbase as a persistence storage for future analytics?
I recommend you reading this old but still valid article : https://www.elastic.co/blog/found-elasticsearch-as-nosql
Keep in mind, Elasticsearch is only a search engine. But it depends if your data are critical or if you can accept to lose some of them like non critical logs.
If you don't want to use an additionnal database with huge large data, you probably can store them into files in something like HDFS.
You should also check Phoenix https://phoenix.apache.org/ which may provide the monitoring features that you are looking for

Kafka Streams with lookup data on HDFS

I'm writing an application with Kafka Streams (v0.10.0.1) and would like to enrich the records I'm processing with lookup data. This data (timestamped file) is written into a HDFS directory on daily basis (or 2-3 times a day).
How can I load this in the Kafka Streams application and join to the actual KStream?
What would be the best practice to reread the data from HDFS when a new file arrives there?
Or would it be better switching to Kafka Connect and write the RDBMS table content to a Kafka topic which can be consumed by all the Kafka Streams application instances?
Update:
As suggested Kafka Connect would be the way to go. Because the lookup data is updated in the RDBMS on a daily basis I was thinking about running Kafka Connect as a scheduled one-off job instead of keeping the connection always open. Yes, because of semantics and the overhead of keeping a connection always open and making sure that it won't be interrupted..etc. For me having a scheduled fetch in this case looks safer.
The lookup data is not big and records may be deleted / added / modified. I don't know either how I can always have a full dump into a Kafka topic and truncate the previous records. Enabling log compaction and sending null values for the keys that have been deleted would probably won't work as I don't know what has been deleted in the source system. Additionally AFAIK I don't have a control when the compaction happens.
The recommend approach is indeed to ingest the lookup data into Kafka, too -- for example via Kafka Connect -- as you suggested above yourself.
But in this case how can I schedule the Connect job to run on a daily basis rather than continuously fetch from the source table which is not necessary in my case?
Perhaps you can update your question you do not want to have a continuous Kafka Connect job running? Are you concerned about resource consumption (load on the DB), are you concerned about the semantics of the processing if it's not "daily udpates", or...?
Update:
As suggested Kafka Connect would be the way to go. Because the lookup data is updated in the RDBMS on a daily basis I was thinking about running Kafka Connect as a scheduled one-off job instead of keeping the connection always open. Yes, because of semantics and the overhead of keeping a connection always open and making sure that it won't be interrupted..etc. For me having a scheduled fetch in this case looks safer.
Kafka Connect is safe, and the JDBC connector has been built for exactly the purpose of feeding DB tables into Kafka in a robust, fault-tolerant, and performant way (there are many production deployments already). So I would suggest to not fallback to "batch update" pattern just because "it looks safer"; personally, I think triggering daily ingestions is operationally less convenient than just keeping it running for continuous (and real-time!) ingestion, and it also leads to several downsides for your actual use case (see next paragraph).
But of course, your mileage may vary -- so if you are set on updating just once a day, go for it. But you lose a) the ability to enrich your incoming records with the very latest DB data at the point in time when the enrichment happens, and, conversely, b) you might actually enrich the incoming records with stale/old data until the next daily update completed, which most probably will lead to incorrect data that you are sending downstream / making available to other applications for consumption. If, for example, a customer updates her shipping address (in the DB) but you only make this information available to your stream processing app (and potentially many other apps) once per day, then an order processing app will ship packages to the wrong address until the next daily ingest will complete.
The lookup data is not big and records may be deleted / added / modified. I don't know either how I can always have a full dump into a Kafka topic and truncate the previous records. Enabling log compaction and sending null values for the keys that have been deleted would probably won't work as I don't know what has been deleted in the source system.
The JDBC connector for Kafka Connect already handles this automatically for you: 1. it ensures that DB inserts/updates/deletes are properly reflected in a Kafka topic, and 2. Kafka's log compaction ensures that the target topic doesn't grow out of bounds. Perhaps you may want to read up on the JDBC connector in the docs to learn which functionality you just get for free: http://docs.confluent.io/current/connect/connect-jdbc/docs/ ?

Torquebox Infinispan Cache - Too many open files

I looked around and apparently Infinispan has a limit on the amount of keys you can store when persisting data to the FileStore. I get the "too many open files" exception.
I love the idea of torquebox and was anxious to slim down the stack and just use Infinispan instead of Redis. I have an app that needs to cache allot of data. The queries are computationally expensive and need to be re-computed daily (phone and other productivity metrics by agent in a call center).
I don't run a cluster though I understand the cache would persist if I had at least one app running. I would rather like to persist the cache. Has anybody run into this issue and have a work around?
Yes, Infinispan's FileCacheStore used to have an issue with opening too many files. The new SingleFileStore in 5.3.x solves that problem, but it looks like Torquebox still uses Infinispan 5.1.x (https://github.com/torquebox/torquebox/blob/master/pom.xml#L277).
I am also using infinispan cache in a live application.
Basically we are storing database queries and its result in cache for tables which are not up-datable and smaller in data size.
There are two approaches to design it:
Use queries as key and its data as value
It leads to too many entries in cache when so many different queries are placed into it.
Use xyz as key and Map as value (Map contains the queries as key and its data as value)
It leads to single entry in cache whenever data is needed from this cache (I call it query cache) retrieve Map first by using key xyz then find the query in Map itself.
We are using second approach.

Resources