I'm new to Hadoop, and running under AWS Elastic Mapreduce.
I need cluster-wide atomic counters in Hadoop and was suggested to use zookeeper for this.
I believe zookeeper is part of the Hadoop stack (right?), how would I access it from an Elastic Mapreduce job in order to set and update a cluster-wide counter?
I believe zookeeper is part of the Hadoop stack (right?)
ZooKeeper (ZK) is not part of the Hadoop Stack. It's a Top Level Project (TLP) under Apache and is independent of Hadoop. So, first ZK has to be installed on EC2. Here are the instructions for the same.
how would I access it from an Elastic Mapreduce job in order to set and update a cluster-wide counter?
Once installed ZK can be used to generate a cluster wide counter using the ZK API. Here (1 and 2) discussions on the approach with the pros and cons. Here are some other alternatives for ZK for the same requirements.
You can, as Praveen Sripati answers.
But I wan't to clarify some points:
Keep in mind, that zk has a limited write rate (~300 request per
second)
Clients can see stale data (zk don't guarantee read consistency across replicas).
I suggest to use dedicated sequence generator server, which will generate sequences for you (and this service can use Zk or whatever it wants). One example of such service: https://github.com/kasabi/H1
Related
I understand based on the slides that in the context of Hadoop that Zookeeper is used for storing information of Master, and status of different tasks, which worker is working on which partition AND also the available workers are also stored in Zookeeper.
Why is Zookeeper is used for this metadata storage here? Any data store can be used right ?
For instance Celery can configure any result backend Redis/Mongo etc. So in practice Hadoop can use any storage backend right? But why Zookeeper?
This doc suggests that Redis, SQLite, MySQL, PostgreSQL can be used for celery task result storage.
https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/index.html
Zookeeper ZAB protocol is utilized for leader election, as well as distributed locks.
It is not simply a datastore, and no, not any can be used.
Celery isn't used within the Hadoop ecosystem, so I'm not sure how that's relevant to the question.
I'm going to do a Hadoop POC in a production environment. The POC consists of:
1. Receive lots of (real life) events
2. Accumulate them to have a set of events with enough size
3. Persist the set of events in a single file HDFS
In case the POC is successful, I want to install a cluster environment but I need to keep the data persisted in the single cluster installation (POC).
Then, the question: How difficult is to migrate the data already persisted in HDFS single cluster to a real cluster HDFS environment?
Thanks in advance (and sorry for my bad english)
Regards
You don't need to migrate anything.
If you're running Hadoop in Pseudo distributed mode, all you need to do is add datanodes that are pointing at your existing namenode and that's it!
I would like to point out
Persist the set of events in a single file HDFS
I'm not sure about making "a single file", but I suggest you do periodic checkpointing. What if the stream fails? How do you catch dropped events? Spark, Flume, Kafka Connect, NiFi, etc can allow you to do this.
And if all you're doing is streaming events, and want to store them for a variable time period, then Kafka is more built for that use case. You don't necessarily need Hadoop. Push events to Kafka, consume them where it makes sense, for example, a search engine or a database (Hadoop is not a database)
I am trying to set up clustered Hadoop and Cassandra. Many sites I've read use a lot of words and concepts I am slowly grasping but I still need some help.
I have 3 nodes. I want to set up Hadoop and Cassandra on all 3. I am familiar with Hadoop and Cassandra individually but how so they work together and how do I configure them to work together? Also, how do I set up one node dedicated to, for example, analytics?
So far I have modified my hadoop-env.sh to point to Cassandra libs. I have put this on all of my nodes. Is that correct? What more do I need to do and how do I run it - start Hadoop cluster or Cassandra first?
Last little question: do I connect directly to Cassandra or to Hadoop from within my Java client?
Rather then connecting them via your java client, you need to install Cassandra On top of Hadoop. Please follow the article for step by step assistance.
BR
I'm beginner programmer and hadoop learner.
I'm testing hadoop full distribute mode using 5 PC(has Dual-core cpu and ram 2G)
before starting maptask and hdfs, I knew that I must configure file(etc/hosts on Ip, hostname and hadoop folder/conf/masters,slaves file) so I finished configured that file
and when debating on seminar in my company, my boss and chief insisted that even if hadoop application running state, if hadoop need more node or cluster, automatically, hadoop will add more node
Is it possible? When I studied about hadoop clusturing, Many hadoop books and community site insisted that after configuration and running application, We can't add more node or cluster.
But My boss said to me that Amazon said adding node on running application is possible.
Is really true?
hadoop master users on stack overflow community, Please tell me detail about the truth.
Yes it indeed is possible.
Here is the explanation in hadoop's wiki.
Also Amazon's EMR enables one to add 100s of nodes on-the-fly in an alreadt running cluster and as soon as the machines are up they are delegated tasks(unstarted mapper and/or reducer tasks) by the master.
So, yes, it is very much possible and is in use and not just in theory.
Can anyone suggest me that whether I can read data from amazon hbase using the org.apache.hadoop.conf.Configuration and org.apache.hadoop.hbase.client.HTablePool.
We are migrating to Amazon's EMR framework having hbase running on top of it.
The present implementation is based on pure Apache hadoop and hbase distributions. I'm trying to verify that no code changes needed even we migrate to amazon's EMR.
Please share your thoughts.
While it should not happen, I would expect the problems and changes related to the nature of EC2 and its networking.
HBase relay on Regions able to renew their leases in timely manner. If Region servers are two busy - because of some massive operations over them, they can not do so and get kicked off the cluster.
In amazon performance of the EC2 instances are much less predictable then in dedicated cluster (unless you use cluster instances), so adjusting timeout parameters and/or nature of your loads might be needed to get cluster to work properly