At the moment, I have a cellar setup with just two nodes (meant for testing); as seen in the dump below:
| Id | Alias | Host Name | Port
--+-------------------+----------------+--------------+-----
x | 192.168.99.1:5702 | localhost:8182 | 192.168.99.1 | 5702
| 192.168.99.1:5701 | localhost:8181 | 192.168.99.1 | 5701
Edit 1 -- Additional Information about the setup (begin):
I have multiple cellar nodes. I am trying to make one node as a master, which is supposed to expose a management web panel via which I would like to fetch stats from all the other nodes. For this purpose, I have exposed my custom implementations of Mbeans involving my business logics. I understand that these mbeans can be invoked using Jolokia, and I am already doing that. So, that means, all these different nodes will have Jolokia installed, while the master node will have Hawtio installed (such that I can connect to slave nodes via Jolokia API through hawtio panel).
Right now, I am manually assigning the alias for every node (which refers to the web endpoint that it exposes via pax.web configuration). This is just a workaround to simplify my testing procedures.
Desired Process:
I have access to the ClusterManager service via service registry. Thus, I am able to invoke clusterManager.listNodes() and loop through the result in my MBean. While looping through this, all I get is the basic node info. But, if it is possible, I would like to parse the etc/org.ops4j.pax.web.cfg file from every node and get the port number (or the value of the property org.osgi.service.http.port).
While retrieving the list of nodes, I would like to get a response as:
{
"Node 1": {
"hostname": "192.168.0.100",
"port": 5701,
"webPort": "8181",
"alias": "Data-Node-A"
"id": "192.168.0.100:5701"
},
"Node 2": {
"hostname": "192.168.0.100",
"port": 5702,
"webPort": "8182",
"alias": "Data-Node-B",
"id": "192.168.0.100:5702"
}
}
Edit 1 (end):
I am trying to find a way to execute specific commands on a particular node. For example, I want to execute a command on Node *:5702 from *:5701 such that *:5702 returns the properties and values of a local configuration file.
My current method is not optimal, as I am setting the alias(the web endpoint for jolokia) of a node manually, and based on that I am retrieving my desired info via my custom mbean. I guess, this is not the best practice.
So far, I have:
Set<Node> nodes = clusterManager.listNodes();
thus, if I loop through this set of nodes, I would like to retrieve config settings from local configuration file from every node based on the node ID.
Do I need to implement something specific to dosgi here?
Or would it be something similar to the sample code of ping-pong (https://github.com/apache/karaf-cellar/tree/master/utils/src/main/java/org/apache/karaf/cellar/utils/ping) from apache-cellar project?
Any input on this would be very helpful.
P.S. I tried posting this in Karaf mailing list, but my posts are getting bounced.
Regards,
Cooshal.
Related
I'm trying to modify prometheus mesos exporter to expose framework states:
https://github.com/mesos/mesos_exporter/pull/97/files
A bit about mesos exporter - it collects data from both mesos /metrics/snapshot endpoint, and /state endpoint.
The issue with the latter, both with the changes in my PR and with existing metrics reported on slaves, is that metrics created lasts for ever (until exporter is restarted).
So if for example a framework was completed, the metrics reported for this framework will be stale (e.g. it will still show the framework is using CPU).
So I'm trying to figure out how I can clear those stale metrics. If I could just clear the entire mesosStateCollector each time before collect is done it would be awesome.
There is a delete method for the different p8s vectors (e.g. GaugeVec), but in order to delete a metric, I need to not only the label name, but also the label value for the relevant metric.
Ok, so seems it was easier than I thought (if only I was familiar with go-lang before approaching this task).
Just need to cast the collector to GaugeVec and reset it:
prometheus.NewGaugeVec(prometheus.GaugeOpts{
Help: "Total slave CPUs (fractional)",
Namespace: "mesos",
Subsystem: "slave",
Name: "cpus",
}, labels): func(st *state, c prometheus.Collector) {
c.(*prometheus.GaugeVec).Reset() ## <-- added this for each GaugeVec
for _, s := range st.Slaves {
c.(*prometheus.GaugeVec).WithLabelValues(s.PID).Set(s.Total.CPUs)
}
},
I can't find where to get the number of current open shards.
I want to make monitoring to avoid cases like this:
this cluster currently has [999]/[1000] maximum shards open
I can get maximum limit - max_shards_per_node
$ curl -X GET "${ELK_HOST}/_cluster/settings?include_defaults=true&flat_settings=true&pretty" 2>/dev/null | grep cluster.max_shards_per_node
"cluster.max_shards_per_node" : "1000",
$
But can't find out how to get number of the current open shards (999).
A very simple way to get this information is to call the _cat/shards API and count the number of lines using the wc shell command:
curl -s -XGET ${ELK_HOST}/_cat/shards | wc -l
That will yield a single number that represents the number of shards in your cluster.
Another option is to retrieve the cluster stats using JSON format, pipe the results into jq and then grab whatever you want, e.g. below I'm counting all STARTED shards:
curl -s -XGET ${ELK_HOST}/_cat/shards?format=json | jq ".[].state" | grep "STARTED" | wc -l
Yet another option is to query the _cluster/stats API:
curl -s -XGET ${ELK_HOST}/_cluster/stats?filter_path=indices.shards.total
That will return a JSON with the shard count
{
"indices" : {
"shards" : {
"total" : 302
}
}
}
To my knowledge there is no single number that ES spits out from any API with the single number. To be sure of that, let's look at the source code.
The error is thrown from IndicesService.java
To see how currentOpenShards is computed, we can then go to Metadata.java.
As you can see, the code is iterating over the index metadata that is retrieved from the cluster state, pretty much like running the following command and count the number of shards, but only for indices with "state" : "open"
GET _cluster/state?filter_path=metadata.indices.*.settings.index.number_of*,metadata.indices.*.state
From that evidence, we can pretty much be sure that the single number you're looking for is nowhere to be found, but needs to be computed by one of the methods I showed above. You're free to open a feature request if needed.
The problem: Seems that your elastic cluster number of shards per node are getting limited.
Solution:
Verify the number of shards per node in your configuration and increase it using elastic API.
For getting the number of shards - use _cluster/stats API:
curl -s -XGET 'localhost/_cluster/stats?filter_path=indices.shards.total'
From elastic docs:
The Cluster Stats API allows to retrieve statistics from a cluster
wide perspective. The API returns basic index metrics (shard numbers,
store size, memory usage) and information about the current nodes that
form the cluster (number, roles, os, jvm versions, memory usage, cpu
and installed plugins).
For updating number of shards (increasing/decreasing), use - _cluster/settings api:
For example:
curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/_cluster/settings' -d '{ "persistent" : {"cluster.max_shards_per_node" : 5000}}'
From elastic docs:
With specifications in the request body, this API call can update
cluster settings. Updates to settings can be persistent, meaning they
apply across restarts, or transient, where they don’t survive a full
cluster restart.
You can reset persistent or transient settings by assigning a null
value. If a transient setting is reset, the first one of these values
that is defined is applied:
the persistent setting the setting in the configuration file the
default value. The order of precedence for cluster settings is:
transient cluster settings persistent cluster settings settings in the
elasticsearch.yml configuration file. It’s best to set all
cluster-wide settings with the settings API and use the
elasticsearch.yml file only for local configurations. This way you can
be sure that the setting is the same on all nodes. If, on the other
hand, you define different settings on different nodes by accident
using the configuration file, it is very difficult to notice these
discrepancies.
curl -s '127.1:9200/_cat/indices' | awk '{ if ($2 == "open") C+=$5*$6} END {print C}'
This works:
GET /_stats?level=shards&filter_path=_shards.total
Reference:
https://stackoverflow.com/a/38108448/4271117
I'm using WAS ND and want to have dmgr profile with federated managed profile app.
I am creating cluster using:
AdminTask.createCluster('[-clusterConfig [-clusterName %s -preferLocal true]]' % nameOfModulesCluster)
Next, I'm configuring my WAS instance, queues, datasources, jdbc, JMS Activation Specs, factories etc.
By the time I want to create cluster member, I'm displaying:
print("QUEUES: \n" + AdminTask.listSIBJMSQueues(AdminConfig.getid('/ServerCluster:ModulesCluster/')))
print("JMS AS: \n" + AdminTask.listSIBJMSActivationSpecs(AdminConfig.getid('/ServerCluster:ModulesCluster/')))
And it returns all queues I've created earlier. But when I'm calling
AdminTask.createClusterMember('[-clusterName %(cluster)s -memberConfig [-memberNode %(node)s -memberName %(server)s -memberWeight 2 -genUniquePorts true -replicatorEntry false] -firstMember [-templateName default -nodeGroup DefaultNodeGroup -coreGroup DefaultCoreGroup -resourcesScope cluster]]' % {'cluster': nameOfCluster, 'node': nameOfNode, 'server': nameOfServer})
AdminConfig.save()
configuration displayed earlier is... gone. Some configuration (like datasources) is still displayed in ibm/console, but queues and jms as are not. The same print is displaying nothing, but member is added to cluster.
I can't find any information using google. I've tried AdminNodeManagement.syncActiveNodes(), but it won't work since I'm using
/opt/IBM/WebSphere/AppServer/bin/wsadmin.sh -lang jython -conntype NONE -f global.py
and AdminControl is not available.
What should I do in order to keep my configuration created before clustering? Do I have to sync it somehow?
This is the default behavior and is due to the -resourcesScope attribute in the createClusterMember command. This attribute determines how the server resources are promoted in the cluster, while adding the first cluster member.
Valid options for resourcesScope are :
Cluster: moves the resources of the first cluster member to the cluster level. The resources of the first cluster member replace the resources of the cluster. (is the default option)
Server: maintains the server resources at the new cluster member level. The cluster resources remain unchanged.
Both: copies the resources of the cluster member (server) to the cluster level. The resources of the first cluster member replace the resources of the cluster. The same resources exist at both the cluster and cluster member scopes.
Since you have set "-resourcesScope cluster" in your createClusterMember command, all configuration created at cluster scope are being deleted/replaced by the empty configurations of the new cluster member.
So, for your configurations to work, set "-resourcesScope server", such that the cluster configurations are not replaced by the cluster member configurations.
AdminTask.createClusterMember('[-clusterName %(cluster)s -memberConfig [-memberNode %(node)s -memberName %(server)s -memberWeight 2 -genUniquePorts true -replicatorEntry false] -firstMember [-templateName default -nodeGroup DefaultNodeGroup -coreGroup DefaultCoreGroup -resourcesScope server]]' % {'cluster': nameOfCluster, 'node': nameOfNode, 'server': nameOfServer})
AdminConfig.save()
Refer "Select how the server resources are promoted in the cluster" section in https://www.ibm.com/support/knowledgecenter/en/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/urun_rwlm_cluster_create2_v61.html for more details.
We use Terraform to create and destroy Mesos DC/OS cluster on AWS EC2. Number of agent nodes is defined in a variable.tf file:
variable "instance_counts" {
type = "map"
default = {
master = 1
public_agent = 2
agent = 5
}
}
Once the cluster is up, you can add or remove agent nodes by changing the number of agent in that file and apply again. Terraform is smart enough to recognize the difference and act accordingly. When it destroy nodes, it tends to go for the highest numbered nodes. For example, if I have a 8-node dcos cluster and want to terminate 2 of the agents, Terraform would take down dcos_agent_node-6 and dcos_agent_node-7.
What if I want to destroy an agent with a particular IP? Terraform must be aware of the IPs because it knows the order of the instances. How do I hack Terraform to remove agents by providing the IPs?
I think you're misunderstanding how Terraform works.
Terraform takes your configuration and builds out a dependency graph of how to create the resources described in the configuration. If it has a state file it then overlays information from the provider (such as AWS) to see what is already created and managed by Terraform and removes that from the plan and potentially creates destroy plans for resources that exist in the provider and state file.
So if you have a configuration with a 6 node cluster and a fresh field (no state file, nothing built by Terraform in AWS) then Terraform will create 6 nodes. If you then set it to have 8 nodes then Terraform will attempt to build a plan containing 8 nodes, realises it already has 6 and then creates a plan to add the 2 missing nodes. When you then change your configuration back to 6 nodes Terraform will build a plan with 6 nodes, realise you have 8 nodes and create a destroy plan for nodes 7 and 8.
To try and get it to do anything different to that would involve some horrible hacking of the state file so that it thinks that nodes 7 and 8 are different to the ones most recently added by Terraform.
As an example your state file might look something like this:
{
"version": 3,
"terraform_version": "0.8.1",
"serial": 1,
"lineage": "7b565ca6-689a-4aab-a3ec-a1ed77e83678",
"modules": [
{
"path": [
"root"
],
"outputs": {},
"resources": {
"aws_instance.test.0": {
"type": "aws_instance",
"depends_on": [],
"primary": {
"id": "i-01ee444f57aa32b8e",
"attributes": {
...
},
"meta": {
"schema_version": "1"
},
"tainted": false
},
"deposed": [],
"provider": ""
},
"aws_instance.test.1": {
"type": "aws_instance",
"depends_on": [],
"primary": {
"id": "i-07c1999f1109a9ce2",
"attributes": {
...
},
"meta": {
"schema_version": "1"
},
"tainted": false
},
"deposed": [],
"provider": ""
}
},
"depends_on": []
}
]
}
If I wanted to go back to a single instance instead of 2 then Terraform would attempt to remove the i-07c1999f1109a9ce2 instance as the configuration is telling it that aws_instance.test.0 should exist but not aws_instance.test.1. To get it to remove i-01ee444f57aa32b8e instead then I could edit my state file to flip the two around and then Terraform would think that that instance should be removed instead.
However, you're getting into very difficult territory as soon as you start doing things like that and hacking the state file. While it's something you can do (and occasionally may need to) you should seriously consider how you are working if this is anything other than a one off case for a special reason (such as moving raw resources into modules - now made easier with Terraform's state mv command).
In your case I'd question why you need to remove two specific nodes in a Mesos cluster rather than just specifying the size of the Mesos cluster. If it's a case of a specific node being bad then I'd always terminate it and allow Terraform to build me a fresh, healthy one anyway.
I have created a ES cluster with ES running on three different machine. In order to make them as cluster i have added the unicast config as below in all the 3 machine in elasticsearch.yml config file.
discovery.zen.ping.unicast.hosts:[IP1, IP2, IP3]
When i run
curl -XGET localhost:9200/_cluster/health?pretty
Am getting No_of_nodes as 3. Now i wanted to remove one node from the cluster
so without changing any config file i ran the below command
curl -XPUT localhost:9200/_cluster/settings -d '{
"transient" :{
"cluster.routing.allocation.exclude._ip" : "IP_adress_of_Node3"
}
}';
After this i ran the second command again to get the cluster details, expected output is NO_of_nodes should be 2 but in the result it is showing number of nodes=3 still even after excluding the node. It will of great help if someone can please tell me what is wrong in the steps followed for removing node.
Thanks
The command cluster.routing.allocation.exclude._ip that you sent to your cluster will not actually remove the node from your cluster, but rather prepare it for removal. What this does is, it instructs Elasticsearch to move all shards that are held on this node away from this node and store them on other nodes instead.
This allows you to then remove the node once it is empty, without causing under-replication of the shards stored on this node.
To actually remove the node from your cluster you would need to remove it from your list of unicast hosts. Of course you can also just shut it down and leave it in the list until you next need to restart your cluster anyway, as far as I am aware that won't hurt anything.