Force ZooKeeper to generate snapshot - snapshot

Is it possible to force ZooKeeper to take a snapshot at a specific time or a specific time interval, Not only via 'snapCount' in ZooKeeper configuration
What I like to do is to schedule a snapshot each day or every 24h. Either that its configured in ZooKeeper or force it via cmdline.
This is to 'rollback' my ZooKeeper to a known state in time, in-case the data gets corrupted or someone adds incorrect information.

The only method I found to force snapshot is to restart Zookeeper node which makes it to create the new snapshot with the latest data. Quite a good solution if you use an ensemble. At least, you will get the consistence data if you will take both - the snapshot and latest log files (snapshot is fuzzy and using it in complex with transaction logs is strongly recommended).
But considering this, forcing the snapshot is not really neccessary, I suppose (you should take log files anyway). Copying log files on active zookeeper is also not a good idea...
So stop zookeeper node and copy it's dataDir (with dataLogDir if it's separated) to have guaranteed healthy backup with consistent data for the moment of time when you stopped zk.

I don't believe Zookeeper internally supports this functionality. However, you can always manually copy the latest snapshot and log files (one each) somewhere for safe keeping. Rolling back to that time is then a matter of copying the files back into the data directory when needed.

Related

How to set HDFS as statebackend for flink

I want to store flink store in HDFS so that after crash I can recover the flink state from HDFS. I am planning to write state to HDFS every 60 second. How Can I achieve this ?
Is this the config I need to follow ?
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/state_backends.html#setting-default-state-backend
And where do I specify the check point interval ? Any link or sample code would be helpful
Choosing where checkpoints are stored (e.g., HDFS) is separate from deciding which state backend to use for managing your working state (which can be on-heap, or in local files managed by the RocksDB library).
These two concepts were cleanly separated in Flink 1.12. In early versions of Flink, the two appeared to be more strongly related than they actually are because the filesystem and rocksdb state backend constructors took a file URI as a parameter, specifying where the checkpoints should be stored.
The best way to manage all of this is to leave this out of your code, and to specify the configuration you want in flink-conf.yaml, e.g.,
state.backend: filesystem
state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints
execution.checkpointing.interval: 10s
Information about checkpointing and savepointing can be found at https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/
On how to configure HDFS as a filesystem, you should check https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/

Ingest log files from edge nodes to Hadoop

I am looking for a way to stream entire log files from edge nodes to Hadoop. To sum up the use case:
We have applications that produce log files ranging from a few MB to hundreds of MB per file.
We do not want to stream all the log events as they occur.
Pushing the log files in their entirety after they have written completely is what we are looking for (written completely = got moved into another folder for example... this is not a problem for us).
This should be handled by some kind of lightweight agents on the edge nodes to the HDFS directly or - if necessary - an intermediate "sink" that will push the data to HDFS afterwards.
Centralized Pipeline Management (= configuring all edge nodes in a centralized manner) would be great
I came up with the following evaluation:
Elastic's Logstash and FileBeats
Centralized pipeline management for edge nodes is available, e.g. one centralized configuration for all edge nodes (requires a license)
Configuration is easy, WebHDFS output sink exists for Logstash (using FileBeats would require an intermediate solution with FileBeats + Logstash that outputs to WebHDFS)
Both tools are proven to be stable in production-level environments
Both tools are made for tailing logs and streaming these single events as they occur rather than ingesting a complete file
Apache NiFi w/ MiNiFi
The use case of collecting logs and sending the entire file to another location with a broad number of edge nodes that all run the same "jobs" looks predestined for NiFi and MiNiFi
MiNiFi running on the edge node is lightweight (Logstash on the other hand is not so lightweight)
Logs can be streamed from MiNiFi agents to a NiFi cluster and then ingested into HDFS
Centralized pipeline management within the NiFi UI
writing to a HDFS sink is available out-of-the-box
Community looks active, development is lead by Hortonworks (?)
We have made good experiences with NiFi in the past
Apache Flume
writing to a HDFS sink is available out-of-the-box
Looks like Flume is more of a event-based solution rather than a solution for streaming entire log files
No centralized pipeline management?
Apache Gobblin
writing to a HDFS sink is available out-of-the-box
No centralized pipeline management?
No lightweight edge node "agents"?
Fluentd
Maybe another tool to look at? Looking for your comments on this one...
I'd love to get some comments about which of the options to choose. The NiFi/MiNiFi option looks the most promising to me - and is free to use as well.
Have I forgotten any broadly used tool that is able to solve this use case?
I experience similar pain when choosing open source big data solutions, simply that there are so many paths to Rome. Though "asking for technology recommendations is off topic for Stackoverflow", I still want to share my opinions.
I assume you already have a hadoop cluster to land the log files. If you are using an enterprise ready distribution e.g. HDP distribution, stay with their selection of data ingestion solution. This approach always save you lots of efforts in installation, setup centrol managment and monitoring, implement security and system integration when there is a new release.
You didn't mention how you would like to use the log files once they lands in HDFS. I assume you just want to make an exact copy, i.e. data cleansing or data trasformation to a normalized format is NOT required in data ingestion. Now I wonder why you didn't mention the simplest approach, use a scheduled hdfs commands to put log files into hdfs from edge node?
Now I can share one production setup I was involved. In this production setup, log files are pushed to or pulled by a commercial mediation system that makes data cleansing, normalization, enrich etc. Data volume is above 100 billion log records every day. There is an 6 edge nodes setup behind a load balancer. Logs are firstly land on one of the edge nodes, then hdfs command put to HDFS. Flume was used initially but replaced by this approach due to performance issue.(it can very likely be that engineer was lack of experience in optimizing Flume). Worth to mention though, the mediation system has a managment UI for scheduling ingestion script. In your case, I would start with cron job for PoC then use e.g. Airflow.
Hope it helps! And would be glad to know your final choice and your implementation.

dncp_block_verification log file increases size in HDFS

We are using cloudera CDH 5.3. I am facing a problem wherein the size of "/dfs/dn/current/Bp-12345-IpAddress-123456789/dncp-block-verification.log.curr" and "dncp-vlock-verification.log.prev" keeps increasing to TBs within hours. I read in some of the blogs and they mention it is an HDFS bug. A temporary solution to this problem is to stop the datanode services and delete these files. But we have observed that the log file increases in size on either of the datanodes (even on the same node after deleting it). Thus, it requires continuous monitoring.
Does anyone have a permanent solution to this problem?
One solution, although slightly drastic, is to disable the block scanner entirely, by setting into the HDFS DataNode configuration the key dfs.datanode.scan.period.hours to 0 (default is 504 in hours). The negative effect of this is that your DNs may not auto-detect corrupted block files (and would need to wait upon a future block reading client to detect them instead); this isn't a big deal if your average replication is 3-ish, but you can consider the change as a short term one until you upgrade to a release that fixes the issue.
Note that this problem will not happen if you upgrade to the latest CDH 5.4.x or higher release versions, which includes the HDFS-7430 rewrite changes and associated bug fixes. These changes have done away with the use of such a local file, thereby removing the problem.

Strategy to persist the node's data for dynamic Elasticsearch clusters

I'm sorry that this is probably a kind of broad question, but I didn't find a solution form this problem yet.
I try to run an Elasticsearch cluster on Mesos through Marathon with Docker containers. Therefore, I built a Docker image that can start on Marathon and dynamically scale via either the frontend or the API.
This works great for test setups, but the question remains how to persist the data so that if either the cluster is scaled down (I know this is also about the index configuration itself) or stopped, and I want to restart later (or scale up) with the same data.
The thing is that Marathon decides where (on which Mesos Slave) the nodes are run, so from my point of view it's not predictable if the all data is available to the "new" nodes upon restart when I try to persist the data to the Docker hosts via Docker volumes.
The only things that comes to my mind are:
Using a distributed file system like HDFS or NFS, with mounted volumes either on the Docker host or the Docker images themselves. Still, that would leave the question how to load all data during the new cluster startup if the "old" cluster had for example 8 nodes, and the new one only has 4.
Using the Snapshot API of Elasticsearch to save to a common drive somewhere in the network. I assume that this will have performance penalties...
Are there any other way to approach this? Are there any recommendations? Unfortunately, I didn't find a good resource about this kind of topic. Thanks a lot in advance.
Elasticsearch and NFS are not the best of pals ;-). You don't want to run your cluster on NFS, it's much too slow and Elasticsearch works better when the speed of the storage is better. If you introduce the network in this equation you'll get into trouble. I have no idea about Docker or Mesos. But for sure I recommend against NFS. Use snapshot/restore.
The first snapshot will take some time, but the rest of the snapshots should take less space and less time. Also, note that "incremental" means incremental at file level, not document level.
The snapshot itself needs all the nodes that have the primaries of the indices you want snapshoted. And those nodes all need access to the common location (the repository) so that they can write to. This common access to the same location usually is not that obvious, that's why I'm mentioning it.
The best way to run Elasticsearch on Mesos is to use a specialized Mesos framework. The first effort is this area is https://github.com/mesosphere/elasticsearch-mesos. There is a more recent project, which is, AFAIK, currently under development: https://github.com/mesos/elasticsearch. I don't know what is the status, but you may want to give it a try.

Setting up a single backup node for an elasticsearch cluster?

Given Elasticsearch cluster with several machines, I would want to have a single machine(special node) that is located on a different geographical region that can effectively sync with the cluster for read only purpose. (i.e. no write for the special node; and that special node should be able to handle all query on its own). Is it possible and how can this be done?
With elasticsearch 1.0 (currently available in RC1) you can use the snapshot & restore api; have a look at this blog too to know more.
You can basically make a snapshot of your indices, then copy the snapshot over to the secondary location and restore it into a different cluster. The nice part is that snapshots are incremental, which means that only the files that have changed since the last snapshot are actually backed up. You can then create snapshots at regular intervals, and import them into the secondary cluster.
If you are not using 1.0 yet, I would suggest to have a look at it, snapshot & restore is a great addition. You can still make backups manually and restore them with 0.90, but you don't have a nice api to do that and you need to do everything pretty much manually.

Resources