Elasticsearch - Adding node without replication? - elasticsearch

I have a master/data Elasticsearch node. It has now reached 90% capacity and I need to provision additional space to continue adding more data.
I have created a new server with 700gb disk space, installed ES & Kibana, and now wish for this second server to provide additional space to / work with the master node.
My problem:
As it says on the ES website:
When you add more nodes to a cluster, it automatically allocates
replica shards.
My issue is that I do not wish to replicate the data from the master node, but instead just provide additional space using this second server which can then be queried by the master node.
My question:
What is the best way to achieve this? Is adding a node the incorrect thing to do here?

Using index-level shard allocation filtering, you can constrain a given index (or set of indexes) to stay on a given node (or set of nodes).
Simply run this:
PUT orders,orders_1,orders_2,orders_3,orders_4,orders_5/_settings
{
"index.routing.allocation.require._name": "your-first-node-name"
}
Note that you can also use ._ip or ._host instead of ._name if you prefer.
Then you can add a new node and let it join the cluster and nothing will rebalance, all your current shards will stay on your current node.
And if you need to create a new index on the second node and want to make sure that it will stay on that node you can specify the same settings at index creation time:
PUT new_orders
{
"settings": {
"index.routing.allocation.require._name": "your-second-node-name"
}
}
The index called new_orders will be created on the second node and stay there.

Related

Can you run an elasticsearch data node after deleting the data folder?

I am running a three node Elasticsearch (ELK) cluster. All nodes have all and the same roles, e.g. data, master, etc. The disk on node 3 where the data folder is assigned became corrupt and that data is probably unrecoverable. The other nodes are running normally and one of them assumed the master role instead.
Will the cluster work normally if I replace the disk and make the empty directory available to elastic again, or am I risking crashing the whole cluster?
EDIT: As this is not explicitly mentioned in the answer, yes, if you add your node with an empty data folder, the cluster will continue normally as if you added a new node to the cluster, but you have to deal with the missing data. In my case, I lost the data as I do not have replicas.
Let me try to explain that in simple way.
Your data got corrupt at node-3 so if you add that that node again, it will not have the older data, i.e. the shards stored in node-3 will remain unavailable for the cluster.
Did you have the replica shards configured for the indexes?
What is the current status(yellow/red) of the cluster when you have
node-3 removed?
If a primary shard isn't available then the master-node promotes one of the active replicas to become the new primary. If there are currently no active replicas then status of the cluster will remain red.

How to know total nodes in an elasticsearch cluster?

I have 3 nodes elasticsearch cluster. If more than one node goes down then I can easily check them manually. Suppose nodes in the cluster got increased then it will be difficult to check them manually. So, how can I get all the nodes(specifically name of the nodes) of the cluster even if they are down?
To get live/healthy nodes I hit the api endpoint:
curl -X GET "hostname/ip:port/_cat/nodes?v&pretty"
Is there any endpoint by using which I can get total nodes and unhealthy/down nodes in elasticsearch cluster?
I was trying to list all the nodes using discovery.seed.hosts present in elasticsearch.yml config file. But I don't know how to do it or is it the right approach or not.
I don't think there is any API to know about offline nodes. If your entire cluster is down or single node down, then Elastic doesn't provide any way to check the node's health. You need to depend on an external script or code or monitoring tool which will ping all your nodes and print status.
You can write a custom script which will call below API and it will return all the nodes which are available in the cluster. Once you have received response, you can filter out IP or hostname of the node and whichever are not coming in response you can consider it as down node.
GET _cat/nodes?format=json&filter_path=ip,name
Another option is to enable cluster monitoring which will give you status of entire cluster but again it will show information about running node only.
Please check this answer for how Kibana show offline node in Cluster Monitoring.

elastic search preference setting Custom Value(Java api)

I really really need some helps on elastic search usage in java api...
Let's assume I am using java api from ES.
So far, I understand that elastic search can give inconsistent result due to primary and replica's inconsistency issue(deleting doc makes stats difference in overall due to deletion marking instead of delete it).
So what I tried it
searchRequest.preference("_primary_first").
This gave me consistent result(since it only uses primary shard!)
Now what I want to try in my toy example is,
1) using preference=Custom (string) value
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-preference
2) if I have 5 nodes, I want to designate which node we want to use based on the queryText.
For instance,
'''
if (queryRequest.text().equals("red")) {
// use 1st node
searchRequest.preference("??????")
} else if (queryRequest.text().equals("blue")) {
// use 2nd node
searchRequest.preference("??????")
} else {
// use either 3rd~5th node <- but this is not necessary if it is really hard..
searchRequest.prefernce("???????")
}
'''
Q1)
I guess I need to use custom setting "WISELY" to denote which node to use...
can someone give me simple java.api example?
Q2)
This is another one, but is there any way we can load status for each node from searchResponse?(again in java api friendly)
Q3)
Is there any clever way to specify to use 1st Node(or certain Node Id??) with given query text?(instead using hashmap things...)
For instance,
let say I don't know which query text I will receive, but I want to evenly distribute them to each node(among 5!)
But want to stick with the first choice.
if I see very first query text == "red" and I designate this queryRequest to use Node1, then later I also want to use Node1 if I see the query text == "red" again.. Does someone have idea?
Thank you guys!
Disclaimer:
I am non-CS guy and independant learner who tried to experiment new things to break my comfort zones! :) Please excuse this silly question!
Actually it's not a silly question and the answer has two parts.
You mention nodes and you want to control which node gets what queries based on an attribute.
Some context:
An elasticsearch cluster has elasticsearch nodes
Your documents will be "saved" in an elasticsearch index and the queries you perform will be against that index
An elasticsearch index is but an abstraction, a layer that hides the complexity of shards (basically lucene indices).
Now when you save a document, that document will eventually be stored in a shard (there are segments etc, but no reason to go any further). Now you can have primary shards and replica shards. When you save something, that will go to a primary shard and will be replicated by elasticsearch to the replica shards (if any). Your searches can and will be served both by primary and replica shards.
Now, you want to control which node gets what. What you can control is which shard gets what via routing on save and via routing on search.
You've asked to control which node get's what. Most of the times you won't be needing this. What you can control is what shard gets what, so you'll need to control which node gets what shard. This can be accomplished via shard allocation awareness.
Both of these topics are advanced features and you'll need to make sure to know what you are doing when trying to use them or you'll get very unexpected results.

how to disable shard re-balancing in elastic search, while allowing new indices to be allocated?

I am using ElasticSearch version 1.0.1 and want to achieve two things at the same time -
1. Allow new indices to be created ( the primary and replica shards need to be allocated as per usual logic).
2. Prevent existing shards to be rebalanced on node failure.
What combination of settings will allow me to achieve the same? I tried the settings from the cluster module documented at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html. But I am unable to achieve both of them at the same time.
Thanks,

Remove of data folder is not synced in Elasticsearch upon index delete

We have an ES cluster with 2 nodes. When we delete an index not all folders in the cluster (on filesystem) are deleted which causes some problems when restarting one server.
Then our deleted indices gets distributed with some weird state indicating that the cluster health is not green.
Example. We delete index with name someIndex and after deletion we check file system, one can see this:
Node1
ElasticSearch\data\clustername\nodes\0\indices\
ElasticSearch\data\clustername\nodes\1\indices\
Node2
ElasticSearch\data\clustername\nodes\0\indices\
ElasticSearch\data\clustername\nodes\1\indices\someIndex (<-- still present)
Anyone know whats causing this?
ES-version: 0.90.5
There are two nodes directories for each on your filesystem (these are nodes\0 and nodes\1).
When you start Elasticsearch, you start up a node (in ES-lingo). Your machine can host multiple nodes, which happens if you start Elasticsearch multiple times. The default settings for the http port is 9200-9300, that means, ES is looking for a free port in that range and binds its node to it (the same is true for the transport module with 9300-9400)
So, if you start an ES process while another is still running, that is, it's bound to a port, you start a second node and ES will create a new directory for it. Maybe this has happened if you issued a restart, but ES couldn't shut down in time before the new node started up.
But now you have a third node in your cluster and ES will assign shards to it. Then you do a cluster restart or something similar and you start one node on each of your machine. ES cannot find the shards that were assigned to the third node, because it's not spun up, and it will show you a red or yellow state, depending on what shards live on the third node. If you delete you index data, you won't delete the data from this missing node.
If you don't care about the data, you can just shutdown ES and delete these directories or start two ES nodes on each of your machines and then delete the index again.
Then you could change the port settings to one specific port, that would prevent second processes from starting up, since they won't be able to bind to a free port.

Resources