Clustering in Elasticsearch - cluster-computing

I have implemented clustering using Elasticsearch. ElasticHead UI displays detected nodes.
However I am not sure how it works. Any one could please provide me a link/direction that shows how clustering works with elasticsearch?

ElasticSearch uses multicasting to see if there are other nodes with same cluster name present in the network.
If there is such a node it connects to it.
It shares it data with it depending upon the shard configuration.
http://www.elasticsearch.org/guide/reference/modules/discovery/zen.html
Read the above to get the full idea

Related

How to know total nodes in an elasticsearch cluster?

I have 3 nodes elasticsearch cluster. If more than one node goes down then I can easily check them manually. Suppose nodes in the cluster got increased then it will be difficult to check them manually. So, how can I get all the nodes(specifically name of the nodes) of the cluster even if they are down?
To get live/healthy nodes I hit the api endpoint:
curl -X GET "hostname/ip:port/_cat/nodes?v&pretty"
Is there any endpoint by using which I can get total nodes and unhealthy/down nodes in elasticsearch cluster?
I was trying to list all the nodes using discovery.seed.hosts present in elasticsearch.yml config file. But I don't know how to do it or is it the right approach or not.
I don't think there is any API to know about offline nodes. If your entire cluster is down or single node down, then Elastic doesn't provide any way to check the node's health. You need to depend on an external script or code or monitoring tool which will ping all your nodes and print status.
You can write a custom script which will call below API and it will return all the nodes which are available in the cluster. Once you have received response, you can filter out IP or hostname of the node and whichever are not coming in response you can consider it as down node.
GET _cat/nodes?format=json&filter_path=ip,name
Another option is to enable cluster monitoring which will give you status of entire cluster but again it will show information about running node only.
Please check this answer for how Kibana show offline node in Cluster Monitoring.

Which elasitcsearch node should i query from my application

If I were to set up my cluster on elastic with 3 master node and 5 to 10 data nodes which node IP address should I actually use in my application to query elastic. I am following Hot warm architecture for elastic but from what I have understood is the master node should always be responsible for handling an incoming request and then coordinating that request to further node in the cluster and to operate on the final response.
So should I only use master node IP addresses in my application to talk with the cluster?
First of all, you shouldn't be using individual IP to connect to a cluster as that can potentially become your single point of failure, if the node goes down. You should have a load balancing URL that connects to data nodes or coordinator nodes to aid your search.
Also, it looks like, you are having dedicated master nodes. Typically for larger size cluster, its not recommended to use master as the search coordinator and should ideally have them in master eligible only role to ensure cluster stability. So you will be left with option of using either data nodes or coordinator only nodes to accept your search requests.
If you are using clients like JEST, NEST etc and not directly using the http endpoint for _search, then you also have option to provide a list of IPs/hostname to form a connection pool.
Like #askids mentioned, always connect to elasticsearch using the standard. Elastic itself provides clients.
https://www.elastic.co/guide/en/elasticsearch/client/index.html
You have not mentioned the clients you are going to be using. If your client is based on Java, use the Elasticsearch's Low-Level or High Level Rest Client. These clients are wrappers on apache http client and provide you all the boilerplate logic of handling connections and other features.
You can also add Sniffer support to it.
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-low.html
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/sniffer.html

Is there a way to instruct Elasticsearch to only return matches from one node

We are designing a large framework around Elasticsearch and are investigating a few options.
For some complex analysis jobs, we are looking for a way to retrieve data from only the currently connected Elasticsearch node, i.e. only data from the primary shard on the node that I am connected to via the client or no result if there is no primary shard located on this node?
Is this possible via some search attribute or via more specialized setup?
We want to use the normal Elasticsearch functionality as much as possible, naturally, but sometimes there might be queries that need this type of access, is this doable with Elasticsearch?
You can restrict the search to specific shards using the preference query string parameter (see https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-request-preference.html).
e.g. by sending your query to http://ES-NODE:9200/INDEXNAME/_search?preference=_shards:1
you should be able to restrict the query to shard 1

Elasticsearch query a specific node for scroll

I have a scan/scroll query where each document that comes back has something done to it and is then the changes are written back. Basically mapping over the whole index (or document type actually).
If the function applied during this mapping starts to become too slow then I need to find a way to split this across several machines.
I could share a scroll ID across multiple machines using Zookeeper or something but will there be issues querying ES from 2 clients at almost the same time?
Alternatively, is there a way to write a query that will only run against one specified node? This way, if I had one 'mapping process' on the same box as one node then I could remove the network overhead.
Check "_only_node" or "_prefer_node" option in ElasticSearch API.

Use Elasticsearch as backup store

My application receives and parse thousands of small JSON snippets each about ~1Kb every hour. I want to create a backup of all incoming JSON snippets.
Is it a good idea to use Elasticsearch to backup this snippets in an index with f.ex. "number_of_replicas:" 4? Never read that anyone has used Elasticsearch for this.
Is my data safe in Elasticsearch when I use a cluster of servers and replicas or should I better use another storage for this use case?
(Writing it to the local file system isn't safe, as our hard discs crashes often. First I have thought about using HDFS, but this isn't made for small files.)
First you need to find difference between replica and backups.
replica is more than one copy of data at run time.It increases high availability and failover support,it wont support accidental delete of data.
Backup is copy of whole data at backup time.it will be used to restore when system crashed.
Elastic search for back up.. its not good idea.. Elastic search is a search engine not DB.If you have not configured ES cluster carefully,then you will end up with loss of data.
So in my opinion ,
To store json object, we got lot of dbs.. For example mongodb is a nosql db.We can easily configure it with more replicas.It means high availability of data and failover support.As you asked its also opensource and more reliable.
for more info about mongodb refer https://www.mongodb.org/
Update:
In elasticsearch if you create index with more shards it'll be distributed among nodes.If a node fails then the data will be lost.But in mongoDB more node means ,each mongodb node contains its own copy of data.If a mongodb fails then we can retrieve out data from replica mongodbs. We need to be more conscious about replica setup and shard allocation in Elasticsearch. But in mongoDB it's easier and good architecture too.
Note: I didn't say storing data in elasticsearch is not safe.I mean, comparing to mongodb,it's difficult to configure replica and maintain in elasticsearch.
Hope it helps..!

Resources