AWS EMR cluster - scale up didn't update dfs.replication value from 1 to 2 - hadoop

I provisioned an AWS EMR HBASE cluster with 1 master and 1 core node (m5.xLarge). My cluster doesn't have any 'task' node as I plan to use this cluster only for storage. The hdfs-site.xml file on both boxes had dfs.replication set to 1 which makes sense. I then manually added 5 more core nodes. I was hoping EMR would bump the replication factor from 1 to 2 as per their docs - https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hdfs-config.html
As I understand, EMR will set the replication factor to 2 if I supply 6 cores during bootstrap, but what about in my use case where I manually scaled the cluster up after I was up and running?

Looks like EMR won't do it automatically. After scaling cluster up, I will need to reconfigure the replication factor by manually reconfiguring the instance group - https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html
--instanceGroups.json below -
[
{
"InstanceGroupId":"<ig-1xxxxxxx9>",
"Configurations":[
{
"Classification":"yarn-site",
"Properties":{
"yarn.nodemanager.disk-health-checker.enable":"true",
"yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage":"100.0"
},
"Configurations":[]
}
]
}
]
aws emr modify-instance-groups --cluster-id <j-2AL4XXXXXX5T9>
--instance-groups file://instanceGroups.json

Related

How to configure 3 new instances as dedicated master nodes in a running cluster with all its master and data nodes (Elasticsarch)?

Context:
We have an elastic search cluster with 10 nodes that are all configured as master: true and data true.
Due to the characteristics of our infrastructure, all the nodes of a cluster (cluster of virtual machines in this case) take their configuration from a Github repository. In other words, each and every single node has the same configuration.
From this group these 10 nodes that are configured as master: true, data: true, 3 are configured as master eligible.
Steps we performed:
We turned off the 10 nodes that were being used by the cluster (all master and data true).
We changed the configuration in the old nodes (let's call that group cluster of virtual machines: elastic-data) to data: true and the new nodes (let's call that new cluster of virtual machines elastic-master) to master: true.
We set the new master as master elegible on both configurations (elastic-master and elastic-data nodes).
We restarted the app.
Problem we found:
The cluster started normally. The queries for cluster administration tasks (list of nodes, search for configurations, etc.) went very fast. With the previous configuration, most of the time, it did not respond. When we tried to perform a query to see the data we got: cannot allocate because all found copies of the shard are either stale or corrupt.
After hours of trying to recover from that state, we decided to roll back the configuration, result cluster continues unstable.
A step-by-step guide on how to do this without leaving the cluster in the described state is highly appreciated.

Yellow status of 'small' ElasticSearch cluster against green status of 'big' cluster in the process of data uploading

I have script for uploading the data to ElasticSearch and it works fine with ES clusters containing 3 ES instances. But running the script against a 2-instance cluster throws that cluster into yellow status. Deleting the index restores them to green.
Found this: "A yellow cluster status means that the primary shards for all indices are allocated to nodes in a cluster, but the replica shards for at least one index are not."
How could I fix that? Should I improve my script somehow with a cluster size switch?
You certainly have in your index settings that you need 2 replicas. And as you cant have a replica and a primary shard on the same node, your cluster cant allocate all your shards in a 2 node cluster.
Could you try to decrease your number of replica to 1 ?
see here for the doc:
PUT /<your_index>/_settings
{
"index" : {
"number_of_replicas" : 1
}
}
Keep us posted !

Hadoop: how to correctly decommission nodes on cluster?

I've been trying to change number of nodes in my Hadoop cluster (in total 5 nodes including 1 master and 4 workers) by following this solution change number of data nodes in Hadoop and this useful post Commissioning and Decommissioning of Datanode in Hadoop Cluster
Now I can see that on HDFS I have successfully decommissioned one node:
HDFS screenshot
I have set up a exclude file property in my hdfs-site.xml hdfs.xml screenshot as well as yarn-site.xml yarn-site.xml screenshot which includes the ip address of the node that I want to decommission. Like follow: excludes file
Also I have done refreshing the nodes commands.
Finally I run hadoop dfsadmin –report and I can see the node is actually decommissioned : nodes report
However, on MapReduce cluster metrics I found that there are 5 active nodes and the node that has been decommissioned on HDFS is not indentified in decommissioned nodes here.
see: hadoop cluster metrics
Why is that?
The issue solves when I changed name of the host on exclude file. The node name should not contain any port number

ElasticSearch: Starting Multiple Cluster

I started two clusters of ElasticSearch with different names but the other one won't show up either in Marvel or querying for health manually.
curl 'http://127.0.0.1:9200/_cat/health?v'
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1501062768 15:22:48 Cove_dev_cluster yellow 1 1 8 8 0 0 8 0 - 50.0%
But it's running on my screen.
I am assuming you are running both clusters (single nodes I believe in this case) on the same machine... In this case the nodes have a default port range setting of 9200-9300 and they are configured to bind to first available port in the specified range. More details available in Network Settings documentation.
So in your case the other cluster is running on port 9201 most likely. If you check for Marvel or query the health manually on port 9201 you should find the other cluster.
However, if you want to have two nodes participating in the same cluster, then make sure that the cluster name matches in the configuration of both instances of elasticsearch you have running.
Hope this helps.

elasticsearch: Poss to change number of replicas after system is running?

elasticsearch 1.7.2 on centos
3 node cluster
This question is how to manage ES config via mods to elasticsearch.yml + restart of elasticsearch service. (Not via api.)
Out of box, the config is:
index.number_of_replicas: 1
So on a 3 node cluster, any 2 nodes have the whole package.
If I want any 1 node to be complete, I would set:
index.number_of_replicas: 2
a) Correct?
b) Can I just walk up to an existing setup and make this change?
c) And, can I just walk up , and adjust it up to 2, and down to 1, whenever? (up to make each node a possible stand alone, down to save disk space)
The number of replica can be changed at any point of time. You can increase or decrease the replica dynamically. There is a good example shown here.
Also please note that , you cant change the number of shards after index creation , but number of replica is open to change via index settings API.
fwiw, another way to do this (I have now proven out) is to update the yml file (elasticsearch.yml). Change the element:
index.number_of_replicas: 2
Up or down, as desired, and restart the elasticsearch service
service elasticsearch restart
The cluster will go yellow (yellow status) while the replicas are being created/moved, and then will go green.

Resources