how to monitor etcd cluster status like what zookeeper's 4 letter word command 'srvr' does - etcd

I want to monitor the throughout like req/sec and latency of each endpoint in the etcd cluster, I know that zookeeper has 4 letter word command 'srvr' can do this., has any tools in etcd?
Thanks for any kind of help!

Related

Control over scheduling/placement in Apache Storm

I am running a wordcount topology in a Storm cluster composed on 2 nodes. One node is the Master node (with Nimbus, UI and Logviewer) and both of then are Supervisor with 1 Worker each. In other words, my Master node is also a Supervisor, and the second node is only a Supervisor. As I said, there is 1 Worker per Supervisor.
The topology I am using is configured so that it is using these 2 Workers (setNumWorkers(2)). In details, the topology has 1 Spout with 2 threads, 1 split Bolt and 1 count Bolt. When I deploy the topology with the default scheduler, the first Supervisor has 1 Spout thread and the split Bolt, and the second Supervisor has 1 Spout thread and the count Bolt.
Given this context, how can I control the placement of operators (Spout/Bolt) between these 2 Workers? For research purpose, I need to have some control over the placement of these operators between nodes. However, the mechanism seems to be transparent within Storm and such a control is not available for the end-user.
I hope my question is clear enough. Feel free to ask for additional details. I am aware that I may need to dig into Storm's source code and recompile. That's fine. I am looking for a starting point and advices on how to proceed.
The version of Storm I am using is 2.1.0.
Scheduling is handled by a pluggable scheduler in Storm. See the documentation at http://storm.apache.org/releases/2.1.0/Storm-Scheduler.html.
You may want to look at the DefaultScheduler for reference https://github.com/apache/storm/blob/v2.1.0/storm-server/src/main/java/org/apache/storm/scheduler/DefaultScheduler.java. This is the default scheduler used by Storm, and has a bit of handling for banning "bad" workers from the assignment, but otherwise largely just does round robin assignment.
If you don't want to implement a cluster-wide scheduler, you might be able to set your cluster to use the ResourceAwareScheduler, and use a topology-level scheduling strategy instead. You would set this by setting config.setTopologyStrategy(YourStrategyHere.class) when you submit your topology. You will want to implement this interface https://github.com/apache/storm/blob/e909b3d604367e7c47c3bbf3ec8e7f6b672ff778/storm-server/src/main/java/org/apache/storm/scheduler/resource/strategies/scheduling/IStrategy.java#L43 and you can find an example implementation at https://github.com/apache/storm/blob/c427119f24bc0b14f81706ab4ad03404aa85aede/storm-server/src/main/java/org/apache/storm/scheduler/resource/strategies/scheduling/DefaultResourceAwareStrategy.java
Edit: If you implement either an IStrategy or IScheduler, they need to go in a jar that you put in storm/lib on the Nimbus machine. The strategy or scheduler needs to be on the classpath of the Nimbus process.

Too many fetch faliuers

I have a setup, 2 node hadoop cluster on Ubuntu 12.04 and Hadoop 1.2.1.
While I am trying to run hadoop word count example I am gettig "Too many fetch faliure error". I have referred many articles but I am unable to figure out what should be the entries in Masters,Slaves and /etc/hosts file.
My nodes names are "master" with ip 10.0.0.1 and "slaveone" with ip 10.0.0.2.
I need assistance in what should be the entries in masters,slaves and /etc/hosts file in both master and slave node?
If you're unable to upgrade the cluster for whatever reason, you can try the following:
Ensure that your hostname is bound to the network IP and NOT 127.0.0.1 in /etc/hosts
Ensure that you're using only hostnames and not IPs to reference services.
If the above are correct, try the following settings:
set mapred.reduce.slowstart.completed.maps=0.80
set tasktracker.http.threads=80
set mapred.reduce.parallel.copies=(>= 10)(10 should probably be sufficient)
Also checkout this SO post: Why I am getting "Too many fetch-failures" every other day
And this one: Too many fetch failures: Hadoop on cluster (x2)
And also this if the above don't help: http://grokbase.com/t/hadoop/common-user/098k7y5t4n/how-to-deal-with-too-many-fetch-failures
For brevity and in interest of time, I'm putting what I found to be the most pertinent here.
The number 1 cause of this is something that causes a connection to get a
map output to fail. I have seen:
1) firewall
2) misconfigured ip addresses (ie: the task tracker attempting the fetch
received an incorrect ip address when it looked up the name of the
tasktracker with the map segment)
3) rare, the http server on the serving tasktracker is overloaded due to
insufficient threads or listen backlog, this can happen if the number of
fetches per reduce is large and the number of reduces or the number of maps
is very large.
There are probably other cases, this recently happened to me when I had 6000
maps and 20 reducers on a 10 node cluster, which I believe was case 3 above.
Since I didn't actually need to reduce ( I got my summary data via counters
in the map phase) I never re-tuned the cluster.
EDIT: Original answer said "Ensure that your hostname is bound to the network IP and 127.0.0.1 in /etc/hosts"

why hadoop lost nodes

I am confused that when I run the commond " hadoop dfsadmin -report" I can see there
but the resource manager , cluster metric, it shows that
why is that and why could that happen?
Thanks in advance!
Your connected with 9 slave nodes. But 5 slave nodes are in active state remaining are in unhealthy state.
Reason for unhealthy state:
Hadoop MapReduce provides a mechanism by which administrators can configure the TaskTracker to run an administrator supplied script periodically to determine if a node is healthy or not. Administrators can determine if the node is in a healthy state by performing any checks of their choice in the script. If the script detects the node to be in an unhealthy state, it must print a line to standard output beginning with the string ERROR. The TaskTracker spawns the script periodically and checks its output. If the script's output contains the string ERROR, as described above, the node's status is reported as 'unhealthy' and the node is black-listed on the JobTracker. No further tasks will be assigned to this node. However, the TaskTracker continues to run the script, so that if the node becomes healthy again, it will be removed from the blacklisted nodes on the JobTracker automatically. The node's health along with the output of the script, if it is unhealthy, is available to the administrator in the JobTracker's web interface. The time since the node was healthy is also displayed on the web interface.
Reason for Lost Nodes:
I think some BLOCKS (data) may not available in slaves. So It shows lost node as 9.
To remove Dead nodes from cluster use this link To Decommission Nodes
Cluster metrics in ResourceManager show the status of NodeManager.
hadoop dfsadmin -report this command shows the status of Datanodes.

How to specific Datanote server on Hadoop cluster

I am running a Hadoop cluster on 4 servers. I see that all servers has TaskTracker and DataNone
I start cluster with hadoop/bin/start-all.sh
I have 2 servers which have very litter Hardware Disk so I want these only run TaskTracker.
How should I config hadoop?
hadoop/bin/start-all.sh actually just calls hadoop/bin/start-dfs.sh followed by hadoop/bin/start-mapred.sh so this provides a convenient way to use different settings for the two sets of daemons. The easiest way is to create a separate file, perhaps called hadoop/conf/datanodes, and then populate it with just the 2 servers that you want to be datanodes; presumably, you also have hadoop/conf/slaves which lists all 4 servers.
echo "my-datanode0" > hadoop/conf/datanodes
echo "my-datanode1" > hadoop/conf/datanodes
Then, run the two commands separately, note that there is no semicolon after the first assignment statement, as you need the environment variable to propagate into the underlying "slaves.sh" call:
HADOOP_SLAVES=hadoop/conf/datanodes ./hadoop/bin/start-dfs.sh
./hadoop/bin/start-mapred.sh
Go ahead and check port 50030 for your JobTracker's listing of TaskTrackers, then 50070 for your NameNode's listing of DataNodes, and you should be good to go.

Aerospike's behavior on EC2

In my test setup on EC2 I have done following:
One Aerospike server is running in ZoneA (say Aerospike-A).
Another node of the same cluster is running in ZoneB (say Aerospike-B).
The application using above cluster is running in ZoneA.
I am initializing AerospikeClinet like this:
hosts= new Host[];
hosts[0] = new Host(PUBLIC_IP_OF_AEROSPIKE-A, 3000);
AerospikeClient client = new AerospikeClient(policy, hosts);
With above setup I am getting below behavior:
Writes are happening on both Aerospike-A and Aerospike-B.
Reads are only happening on Aerospike-A (data is around 1million records, occupying 900MB of memory and 1.3 GB of disk)
Question: Why are reads not going to both the nodes?
If I take Aerospike-B down, everything works perfectly. There is no outage.
If I take Aerospike-A down, all the writes and reads start failing. I've waited for 5 mins for other node to take traffic but it didn't work.
Questions:
a. In above scenario, I would expect Aerospike-B to take all the traffic. But this is not happening. Is there anything I am doing wrong?
b. Should I be giving both the hosts while initializing the client?
c. I had executed "clinfo -v 'config-set:context=service;paxos-recovery-policy=auto-dun-all'" on both the nodes. Is that creating a problem?
In EC2 you should place all the nodes of the cluster in the same AZ of the same region. You can use the rack awareness feature to set up nodes in two separate AZs, however you will be paying with a latency hit on each one of your writes.
Now what your seeing is likely due to misconfiguration. Each EC2 machine has a public IP and a local IP. Machines sitting on the same subnet may access each other through the local IP, but a machine from a different AZ cannot. You need to make sure your access-address is set to the public IP in the event that your cluster nodes are spread across two availability zones. Otherwise you have clients which can reach some of the nodes, lots of proxy events as the cluster nodes try to compensate and move your reads and writes to the correct node for you, and weird issues with data upon nodes leaving or joining the cluster.
For more details:
http://www.aerospike.com/docs/operations/configure/network/general/
https://github.com/aerospike/aerospike-client-python/issues/56#issuecomment-117901128

Resources