Multi data center for Clickhouse - clickhouse

Does clickhouse Multi Master or multi data center set up support?
Any other solutions for multi data center replication for clickhouse?

CH is multi-master only.
CH is multi / geo DC out the box. There are many users with cross-ocean DCs.
The only requirement is proper latency for Replicated* Engines.
All!!!!! ZK nodes should be in the same DC or in DCs with latency < 50ms. CH loading nodes (which ingest data) should be as close as possible to ZK (better <100ms). Non-loading replicas can be far -- 150-250ms.
Cross-ocean setup needs proper configuration of load-balancing to run queries on local-DC replicas and tuning some params (connect_timeout_with_failover_ms -- 50ms by default).

yes, clickhouse can be setup as multi-DC
please read about Distributed engine
https://clickhouse.yandex/docs/en/table_engines/distributed/
also look to load_balancing settings
https://clickhouse.yandex/docs/en/operations/settings/settings/#settings-load_balancing

Related

Apache NiFi - Can it scale at the processor level?

Newbie Alert to Apache NiFi!
Curious to understand (and read relevant material) on the scalability aspects of Apache NiFi pipeline in a clustered set up.
Imagine there is a 2 node cluster Node 1 & Node 2.
A simple use case as an example:
Query a Database Table in batches of 100 (Lets say there are 10 batches).
For each batch, call a REST API (invoke Http).
If a pipeline is triggered on Node 1 in a cluster, Does this mean all the 10 batches are run only in Node 1?
Is there any work distribution "out-of-the-box" available in NiFi at every processor level? Along the lines of 5 batches are executed for the REST API calls per node.
Is the built-in queue of NiFi distributed in nature?
Or is the recommended way to scale at the processor level is to publish the output of the previous processors to a messaging middleware (like Kafka) and then make the subsequent NiFi processor to consume from it?
What's the recommended way to scale at every processor level in NiFi?
every queue has a load balancing strategy parameter with following options:
Do not load balance: Do not load balance FlowFiles between nodes in the cluster. This is the default.
Partition by attribute: Determines which node to send a given FlowFile to based on the value of a user-specified FlowFile Attribute.
Round robin: FlowFiles will be distributed to nodes in the cluster in a round-robin fashion.
Single node: All FlowFiles will be sent to a single node in the cluster.
Details in documentation:
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Load_Balancing

Scalable elasticsearch module with spring data elasticsearch possible?

I am working on designing a scalable service(springboot) using which data will be indexed to elastic search.
Use case:
My application uses 6 databases(mySql) having same schema. Each database caters to specific region. I have a micro service that connects to all these dbs and indexes data from specific tables to elasicsearch server(v6.8.8) in similar fashion having 6 elasticsearch indexes one for each db.
Quartz jobs are employed for this purpose and RestHighLevelClient. Also there are delta jobs running each second to look for changes using audit and indexes.
Current problem:
Current design is not scalable - one service doing all the work(data loading, mapping, upsert in bulk). Because indexing is done through quarts jobs, scaling services(running multiple instances) will run the same job multiple times.
No failover - Looking for a distributed elasticsearch nodes and indexing data to both nodes. How to do this efficiently.
I am considering spring data elasticsearch to index data sametime when it is going to be persisted to db.
Does it offer all features ? I use :
Elasticsearch right from installing template to creating/deleting indexes, aliases.
Blue/green deployment - index to non-active nodes and change the aliases.
bulk upsert, querying, aggregations..etc
Any other solutions are welcome. Thanks for your time.
Your one of the use case is to move data from DB (Mysql) to ES in a scalable manner. It is basically a CDC (Change data capture) pipeline.
You can use kafka-connect framework for the same.
The flow should be like:
Read Mysql Transaction logs => Publish the data to Kafka (This can be accomplished using Debezium Source Connector)
Consume data from Kafka => Push it to Elastic Search (This can be accomplished using ES-SYNC Connector)
Why to use the framework ?
Using connect framework data can be read directly from Mysql Transaction logs without writing code.
Connect framework is a distributed & Scalable system
It will reduce the load on your database as you now don't need to query your database for detecting any changes
Easy to set-up

Can I decrease usage of Non-Table Cluster Data on my cockroach cluster?

On my cockroachDB cluster, the Time Series data grows up to about 1 GByte.
Is there any way to decrease it? Thank you!
Yes, you can control this. By default, CockroachDB stores timeseries data for the last 30 days for display in the Admin UI, but you can reduce the interval for timeseries storage or disable timeseries storage entirely.
Reduce the interval for timeseries storage
To reduce the interval for storage of timeseries data, change the timeseries.storage.resolution_10s.ttl cluster setting to an INTERVAL value less than 720h0m0s (30 days). For example, to store timeseries data for the last 15 days, run the following SET CLUSTER SETTING command:
SET CLUSTER SETTING timeseries.storage.resolution_10s.ttl = '360h0m0s';
Disable timeseries storage entirely
Note: Disabling timeseries storage entirely is recommended only if you exclusively use a third-party tool such as Prometheus for timeseries monitoring. Prometheus and other such tools do not rely on CockroachDB-stored timeseries data; instead, they ingest metrics exported by CockroachDB from memory and then store the data themselves.
To disable the storage of timeseries data entirely, run the following command:
SET CLUSTER SETTING timeseries.storage.enabled = false;
If you want all existing timeseries data to be deleted, change the timeseries.storage.resolution_10s.ttl cluster setting as well:
SET CLUSTER SETTING timeseries.storage.resolution_10s.ttl = '0s';
Further reference: https://www.cockroachlabs.com/docs/stable/operational-faqs.html#can-i-reduce-or-disable-the-storage-of-timeseries-data

Clustered NIFI, Only one node is working

I'm using NIFI in a clustered mode with two nodes, and I have noticed that only one node that do all the work.
Any idea why is that ? and how can I make nifi2 do some of the processing of the dataflow ?
It depends how data is coming in to your cluster. It is up to you as the data flow designer to create an approach that allows the data to be partitioned across your cluster for processing.
See this post for an overview of strategies to do this:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

Rethink DB Cross Cluster Replication

I have 3 different pool of clients in 3 different geographical locations.
I need configure Rethinkdb with 3 different clusters and replicate data between the (insert, update and deletes). I do not want to use shard, only replication.
I didn't found in documentation if this is possible.
I didn't found in documentation how to configure multi-cluster replication.
Any help is appreciated.
I think that multi cluster is just same a single clusters with nodes in different data center
First, you need to setup a cluster, follow this document: http://www.rethinkdb.com/docs/start-a-server/#a-rethinkdb-cluster-using-multiple-machines
Basically using below command to join a node into cluster:
rethinkdb --join IP_OF_FIRST_MACHINE:29015 --bind all
Once you have your cluster setup, the rest is easy. Go to your admin ui, select the table, in "Sharding and replication", click Reconfigure and enter how many replication you want, just keep shard at 1.
You can also read more about Sharding and Replication at http://rethinkdb.com/docs/sharding-and-replication/#sharding-and-replication-via-the-web-console

Resources