The documentation almost exactly the same for both, so I find it really hard to see the difference between them. Based on this Q&A, the sole difference is
[...] that checkpoint node can upload the new copy of fsimage file back to namenode after checkpoint creation where as a secondary namenode can’t upload [...]
It doesn't seem correct, because the Secondary NameNode can upload the new FsImage, based on this and this.
Can the Checkpoint Node be considered as a one-to-one replacement of the Secondary NameNode? What are the added benefits? Was the code cleaned up between the two or something like that?
In the cloudera post you mentioned, the checkpointing process is clearly mentioned in both HighAvailability and NonHighAvailability scenarios. Secondary name node performs the task of checkpointing namenode in Non-HA scenario. In other scenario of HA, we can use standby namenode for checkpointing. In summary, checkpointing is more of a concept and depending of the scenario(HA/nonHA), different nodes perform that operation.
You can read that blog again. And let me know if any corrections are needed. Happy Learning
The difference between SNN and Check point NN is that SNN stores the data locally in file system but it does not upload the merged fsimage (with edit logs) to Active NN which Checkpoint NN does.
Even Apache documentation page does not cover differences properly. From the documentation page, it seems that role of both Secondary Name node & Checkpoint Node seems to be similar.
On a different note, I have raised a bug to correct the documentation page to avoid confusion : https://issues.apache.org/jira/browse/HDFS-8913
I hope this bug will be resolved soon.
Regarding your second query:
Can the Checkpoint Node be considered as a one-to-one replacement of the Secondary NameNode? What are the added benefits? Was the code cleaned up between the two or something like that?
Lot of things have changed with Hadoop 2.x release and Name node is not a single point of failure.
High Availability of Active name node with help of Stand by Name node is key feature in Hadoop 2.x.
You just need Active Name node and Stand by Name Node to achieve high availability.
Hadoop 2.x High Availability has been explained clearly in other SE questions:
Hadoop namenode : Single point of failure
How does Hadoop Namenode failover process works?
Related
We have a small Elasticsearch cluster for 3 nodes: two in one datacenter and one in another for disaster recovery reasons. However, if the first two nodes fail simultaneously, the third one won't work either - it will just throw "master not discovered or elected yet".
I understand that this is intended - this is how Elasticsearch cluster should work. But is there some additional special configuration that I don't know to keep the third single node working, even if in the read-only mode?
nope, there's not. as you mentioned it's designed that way
you're probably not doing yourselves a lot of favours by running things across datacentres like that. network issues are not kind on Elasticsearch due to it's distributed nature
Elasticsearch runs in distributed mode by default. Nodes assume that there are or will be a part of the cluster, and during setup nodes try to automatically join the cluster.
If you want your Elasticsearch to be available for only node without the need to communicate with other Elasticsearch nodes. It works similar to a standalone server. To do this we can tell Elasticsearch to work in local only (disable network)
open your elasticsearch/config/elasticsearch.yml and set:
node.local: true
I have a Flink app with high parallelism (400) running in AWS EMR. It sources Kafka and sinks to S3 using BucketingSink (using RocksDb backend for checkpointing). The destination is defined using "s3a://" prefix. The Flink job is a streaming app which runs continuously. At any given time, it's possible that all workers combined will generate/write to 400 files (due to 400 parallelism). After a few days, one of the workers will fail with the exception:
org.apache.hadoop.fs.s3a.AWSS3IOException: copyFile(bucket/2018-09-01/05/_file-10-1.gz.in-progress, bucket/2018-09-01/05/_file-10-1.gz.pending): com.amazonaws.services.s3.model.AmazonS3Exception: We encountered an internal error. Pelase try again. (Service: Amazon S3; Status Code: 200 InternalError; Request ID: xxxxxxxxxx; S3 Extended Request ID: yyyyyyyyyyyyyyy
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java: 178)
at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(S3AFileSystem.java: 1803)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerRename(S3AFileSystem.java:776)
at org.apache.hadoop.fs.s3a.S3AFileSystem.rename(S3AFileSystem.java:662)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.closeCurrentPartFile(BucketingSink.java:575)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.openNewPartFile(BucketingSink.java:514)
at org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.invoke(BucketingSink.java:446)
This seems to randomly occur when a new part file is created by the BucketingSink. The odd thing is that this happens randomly and when it occurs, it happens to 1 of the parallel flink workers (not all). Also, when this occurs, the Flink job transitions into a FAILING state, but the Flink job does not restart and resume/recover from the last successful checkpoint. What is the cause for this and how should it be resolved? Additionally, how can the job be configured to restart/recover from the last successful checkpoint instead of remaining in the FAILING state?
I think this is known behavior with the bucketing sink and S3, and the suggested solution is to use the shiny new StreamingFileSink in Flink 1.7.0.
Basically, the bucketing sink expects writes and renames to happen immediately like they would in a real file system, but that isn't a good assumption for object stores like S3, so the bucketing sink ends up with race conditions that cause intermittent problems. Here's a JIRA ticket that sort of describes the problem, and the related tickets flesh it out a bit more. JIRA FLINK-9752
We are having a two node cluster of aerospike. We thought of adding two more nodes to the cluster. As soon I added them we are getting queue too deep error on new nodes and as well Device over load on client.
I tried of making migrate-max-num-incoming from 256 to 4. Still the issue persists.
What is the best way to add a new node to cluster without impacting the clients.
More info:
1) We are using SSD based installation
2) we are using mesh node architecture
Your storage is not keeping up.
The following links should help:
1- Understand device overload:
https://discuss.aerospike.com/t/device-overload/733
2- Understand how to tune migrations:
http://www.aerospike.com/docs/operations/manage/migration#lowering-the-migration-rate
3- This could also be caused by defragmentation on the previous nodes in the cluster as data migrating out will cause a vacuum effect and could cause defragmentation activity to pick up, in which case you would want to slow down defragmentation by tuning defrag sleep down:
http://www.aerospike.com/docs/reference/configuration#defrag-sleep
Add one node at a time. Wait till migrations are complete before adding second node. (I assume all nodes are running the same version of Aerospike and configuration is consistent, all have same namespace defined etc.)
I use ganglia to monitor hadoop. I choose the metric "dfs.datanode.HeartbeatsAvgTime" to judge whether the datanode(I mean datanode service, not the host.) is down or not.
When the datanode is working fine, the "dfs.datanode.HeartbeatsAvgTime" is remaining changing. That's to say ,the value in the graph is varing.
It looks like this:
but after I stop the datanode service, the value in the graph remains not change.
It looks like this:
The value in the second graph remains unchanged.But the value is not 0 or infinity. So, I can not judge the datanode service is up or down.
It is the same when deal with other metrics.
I've check rrd which is used by ganglia to store the metric data with "rrdtool fetch". The value about the metric is stored in a *.rrd file.when I check the file, I find that after I stop datanode, the value about the metric is also updated. But the value is not varing.
I read the references about rrd in rrd's official website. they says that, if rrd did not receive update date between the interval setted before, rrd write UNKNOWN in the *.rrdfile.
I think that there may be two causes to raise the problem.
when the gmetad did not receive metric. it update rrd with the old value.So the graph stay the same as the old value.
when gmond can not collect metric, it report the old value to gmetad.
But I haven't really find any evidence in the source code in the github of ganglia.
So do you know how to solve the problem that value in the graph remain unchanged? or do you know other details about how to monitor hadoop cluster with ganglia?
#DaveStephens #Lorin Hochstein
After my struggle to solve the problem, I found that if we set dmax of the metric in hadoop-metrics2.properties, when the hadoop break down, ganglia would not receive any data, and return UNKNOW. The graph in the ganglia website will disappear . when ganglia + nagios, nagios will also return UNKNOW status. that's enough to judge whether hadoop is up or down.
dmax means that after dmax time, hadoop will destroy the metric.
I'm trying to figure out how to restore the state of a storm bolt intance during failover. I can persist the state externally (DB or file system), however once the bolt instance is restarted I need to point to the specific state of that bolt instance to recover it.
The prepare method of a bolt receives a context, documented here http://nathanmarz.github.io/storm/doc/backtype/storm/task/TopologyContext.html
What is not clear to me is - is there any piece of this context that uniquely identifies the specific bolt instance so I can understand which persistent state to point to? Is that ID preserved during failover? Alternatively, is there any variable/object I can set for the specific bolt/instance that is preserved during failover? Any help appreciated!
br
Sib
P.S.
New to stackoverflow so pls bear with me...
You can probably look for Trident Its basically an abstraction built on top of storm . The documentation says
Trident has first-class abstractions for reading from and writing to stateful sources. The state can either be internal to the topology – e.g., kept in-memory and backed by HDFS – or externally stored in a database like Memcached or Cassandra
In case of any fail over it says
Trident manages state in a fault-tolerant way so that state updates are idempotent in the face of retries and failures.
You can go through the documentation for any further clarification.
Tx (and credit) to Storm user group!
http://mail-archives.apache.org/mod_mbox/storm-user/201312.mbox/%3C74083558E4509844944FF5CF2BA7B20F1060FD0E#ESESSMB305.ericsson.se%3E
In original Storm, both spout and bolt are stateless. Storm can managed to restart nodes but it will require some effort to restore the states of nodes. There are two solutions that I can think of:
If a message fails to process, Storm will replay it from ROOT of the topology and the logic of replay has to be implemented by user. So in this case I would like to put more state information (e.g. the ID of some external state storage and id of this task) in messages.
Or you can use Trident. It can provides txid to each transaction and simplify storage process.
I'm OK with first solution because my app doesn't require transactional operations and I have a better understanding of the original Storm (Storm generates simpler log than Trident does).
You can use the task ID.
Task ids are assigned at topology creation and are static. If a task dies/restarts or gets reassigned somewhere else, it will still have the same id.