The question is more around: "How do Elasticsearch nodes interact to give a specific search result and what is the flow of a search request?"
I've referred to the following links to understand, but they aren't very clear, in what I am trying to understand.
https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html
As per the above documentation,
"Data Nodes" are the ones which perform all the processing when an _search query is invoked.
"Ingest nodes" do some pre-processing before indexing the data.
So, are these above two statements correct?
Accordingly,
Do Ingest nodes have any role to perform when an _search query happens?
Do Data Nodes have any role to perform when data is being indexed?
Do any other nodes have any role to perform when data is being searched?
Or if you could help explain the flow of a search request (which node receives the API call, which node filters the data, which node runs the aggregations, etc.), then that would be really helpful.
In case it is relevant, then I am on Elastic Search 7.5
Do Ingest nodes have any role to perform when an _search query happens?
if it's a dedicated ingest node than no, if it also holds the data(shards and replica) than yes.
Do Data Nodes have any role to perform when data is being indexed?
Yes, data nodes actually hold the data(shards and replica), and ultimately they are responsible for indexing and searching this data
Do any other nodes have any role to perform when data is being searched? Yes, please refer to the responsibility of co-ordinating role in ES.
In short, ingest node just do the transformation of the data, and data nodes actually hold the data, and all the roles can be dedicated or shared to a node in ES.
Below are the steps in a search request--
Coordinating node receives the request and it can be a dedicated node or data nodes does this work(default).
Coordinating node forwards the request to data nodes, which holds the shards(primary or replica) for your search request.
Data nodes do the local search and send the result back to the coordinating node.
Coordinating node will aggregate the top 10 search results(default is 10) from all nodes and send back the response.
Related
I have 2 nodes Elasticsearch cluster with IP addresses of xx.xx.xx.17(master) and xx.xx.xx.18(data). I know this is the documented way of searching on preferred replica/node.
The question is, If I send my request targeting xx.xx.xx.18(data) node (as an example- http://xx.xx.xx.18:9200/product/_count) will the request be querying that specific node?
OR is the only way of querying a preferred node is sending it with the 'preferred' parameter as in the above link?
when you send a query to an Elasticsearch node, it will talk to any and all other nodes that hold data for indices that need to be queried. if you have replicas assigned to indices, Elasticsearch will randomly pick between the primary and (n) replica shards
assuming each node of yours holds a full copy of every shard, either primary or replica, this means you might get your response from all shards on that node or not, which is what LeBigCat hints on above
however you can use preference here, yes. but it's not clear what problem you are trying to solve with this
I am going through the documentation to better understand the role of coordinating node, the different phases of search request -
I come across a phase -
Each shard returns just enough information to the coordinating node
What sort of information this phrase refers - "just enough information" ?
If we had complex queries like bool queries, aggregation - I presume coordinating need to execute the same query again to aggregation the results globally, in that case, coordinate node will also have some come kind of lucene engine running to aggregate the results ?
A coordinating node can have(when act as a data-node and your index shard is present on it) or can't have the data(when used as a dedicated coordinating node or your index's shard isn't present) of your index.
All it does gather the result from all other participating data nodes in the query and create a priority queue and return the top result.
To answer your question,
I presume coordinating need to execute the same query again to
aggregation the results globally, in that case, coordinate node will
also have some come kind of lucene engine running to aggregate the
results ?
No, the co-ordinating node will not aggregate the results and will not query again to aggregate to produce global results.
You can think of it, you need to get the top 10 documents in your index, assume you have 5 shards and 5 data nodes, then every shard will calcualte the top 10 documents and send it co-ordinating node with the score, now coordinating node, will create a priority queue and returns the top 10 documents, but for that it doesn't have to run another query or it just have to sort the top 50 documents returned from 5 data nodes which already have score and returns the top 10 docs.
Good read on this https://discuss.elastic.co/t/how-does-elasticsearch-process-a-query/191181 and https://www.elastic.co/blog/elasticsearch-query-execution-order
If I have 3 data nodes and perform a query with a lot of aggregations, this search is distributed through all cluster data nodes?
Or the Elasticsearch elects one node to query and aggregate the data? Acting as a load balancer and not as like a "distributed map/reduce"
If the index you're querying contains more than one shard (whether primary or replica), then those shards will be located on different nodes, hence the query will be distributed to each node that hosts a shard of the index you're querying.
One data node will receive your request and act as the coordinating node. It will check the cluster state to figure out where the shards are located, then it will forward the request to each node hosting a shard, gather the results and send them back to the client.
I am planning to segregate Elasticsearch index and search requests as much as possible to avoid any unnecessary delay in the indexing process. There is no such a thing as an Elasticsearch dedicated search node or index node. However, I was wondering if the following scenario is suitable. As far as I understood, I cannot segregate search requests from index requests completely because at the end both hit ES data nodes, but it is what I think can help a little:
Few Elasticsearch Coordinator nodes (No master/data) to deal with search requests and route them to the corresponding data node. Hence, for creating search client to deal with search requests, coordinator node URL will be used only.
Use Elasticsearch data nodes directly for the index path and ignore coordinator nodes for indexing.
In this case, the receiving data node will act as a coordinator node for indexing path and dedicated coordinator nodes will be used to route to a replica on data nodes. Data node unnecessary load due to search routing can be minimised.
I was wondering if there is another way to provide segregation at a higher level or I am insane to not use coordinator nodes for the indexing path as well.
P.S: My use case is heavy indexing and light/medium search
You cant separate indexing and search operations, indexing will write on the primary shard, then on the replica shard, whereas search can be done only on primary shards.
If you care about write performance:
no replica
refresh_interval > 30s, keep analyzer simple
lot of shards (across data nodes)
send insert/update queries on data nodes directly
try to have a hot/cold data architecture (hot/cold indices)
Coordinator nodes can not improve search performance at all, this depends on your workload (aggs etc...).
As usually, all tuning stuff depend on your data and usage, you must find the good balance between indexation and searching performance, use the _node/stats endpoint to see whats going on.
We have an elastic search cluster set up with two nodes. We want the second node only for replication as load isn't enough to warrant a second node. All primary shards are on the master.
Now here's the problem, every other query gets forwarded to the secondary node. As a result, query times are doubled. I expect this is due to elasticsearch's load balancing.
Is there a way to prevent queries from being delegated?
If you specify preference=_local on the search request url, the request will be executed on the node that received the request (assuming that this node has required shards allocated on it). See http://www.elasticsearch.org/guide/reference/api/search/preference/ for more information.