Coordinating node - just enough information to perform aggregation - elasticsearch

I am going through the documentation to better understand the role of coordinating node, the different phases of search request -
I come across a phase -
Each shard returns just enough information to the coordinating node
What sort of information this phrase refers - "just enough information" ?
If we had complex queries like bool queries, aggregation - I presume coordinating need to execute the same query again to aggregation the results globally, in that case, coordinate node will also have some come kind of lucene engine running to aggregate the results ?

A coordinating node can have(when act as a data-node and your index shard is present on it) or can't have the data(when used as a dedicated coordinating node or your index's shard isn't present) of your index.
All it does gather the result from all other participating data nodes in the query and create a priority queue and return the top result.
To answer your question,
I presume coordinating need to execute the same query again to
aggregation the results globally, in that case, coordinate node will
also have some come kind of lucene engine running to aggregate the
results ?
No, the co-ordinating node will not aggregate the results and will not query again to aggregate to produce global results.
You can think of it, you need to get the top 10 documents in your index, assume you have 5 shards and 5 data nodes, then every shard will calcualte the top 10 documents and send it co-ordinating node with the score, now coordinating node, will create a priority queue and returns the top 10 documents, but for that it doesn't have to run another query or it just have to sort the top 50 documents returned from 5 data nodes which already have score and returns the top 10 docs.
Good read on this https://discuss.elastic.co/t/how-does-elasticsearch-process-a-query/191181 and https://www.elastic.co/blog/elasticsearch-query-execution-order

Related

Querying specific Elastic Search Node - Do both does the same or not?

I have 2 nodes Elasticsearch cluster with IP addresses of xx.xx.xx.17(master) and xx.xx.xx.18(data). I know this is the documented way of searching on preferred replica/node.
The question is, If I send my request targeting xx.xx.xx.18(data) node (as an example- http://xx.xx.xx.18:9200/product/_count) will the request be querying that specific node?
OR is the only way of querying a preferred node is sending it with the 'preferred' parameter as in the above link?
when you send a query to an Elasticsearch node, it will talk to any and all other nodes that hold data for indices that need to be queried. if you have replicas assigned to indices, Elasticsearch will randomly pick between the primary and (n) replica shards
assuming each node of yours holds a full copy of every shard, either primary or replica, this means you might get your response from all shards on that node or not, which is what LeBigCat hints on above
however you can use preference here, yes. but it's not clear what problem you are trying to solve with this

How do _search queries work in Elasticsearch?

The question is more around: "How do Elasticsearch nodes interact to give a specific search result and what is the flow of a search request?"
I've referred to the following links to understand, but they aren't very clear, in what I am trying to understand.
https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html
As per the above documentation,
"Data Nodes" are the ones which perform all the processing when an _search query is invoked.
"Ingest nodes" do some pre-processing before indexing the data.
So, are these above two statements correct?
Accordingly,
Do Ingest nodes have any role to perform when an _search query happens?
Do Data Nodes have any role to perform when data is being indexed?
Do any other nodes have any role to perform when data is being searched?
Or if you could help explain the flow of a search request (which node receives the API call, which node filters the data, which node runs the aggregations, etc.), then that would be really helpful.
In case it is relevant, then I am on Elastic Search 7.5
Do Ingest nodes have any role to perform when an _search query happens?
if it's a dedicated ingest node than no, if it also holds the data(shards and replica) than yes.
Do Data Nodes have any role to perform when data is being indexed?
Yes, data nodes actually hold the data(shards and replica), and ultimately they are responsible for indexing and searching this data
Do any other nodes have any role to perform when data is being searched? Yes, please refer to the responsibility of co-ordinating role in ES.
In short, ingest node just do the transformation of the data, and data nodes actually hold the data, and all the roles can be dedicated or shared to a node in ES.
Below are the steps in a search request--
Coordinating node receives the request and it can be a dedicated node or data nodes does this work(default).
Coordinating node forwards the request to data nodes, which holds the shards(primary or replica) for your search request.
Data nodes do the local search and send the result back to the coordinating node.
Coordinating node will aggregate the top 10 search results(default is 10) from all nodes and send back the response.

How to segregate Elasticsearch index and search path as much as possible

I am planning to segregate Elasticsearch index and search requests as much as possible to avoid any unnecessary delay in the indexing process. There is no such a thing as an Elasticsearch dedicated search node or index node. However, I was wondering if the following scenario is suitable. As far as I understood, I cannot segregate search requests from index requests completely because at the end both hit ES data nodes, but it is what I think can help a little:
Few Elasticsearch Coordinator nodes (No master/data) to deal with search requests and route them to the corresponding data node. Hence, for creating search client to deal with search requests, coordinator node URL will be used only.
Use Elasticsearch data nodes directly for the index path and ignore coordinator nodes for indexing.
In this case, the receiving data node will act as a coordinator node for indexing path and dedicated coordinator nodes will be used to route to a replica on data nodes. Data node unnecessary load due to search routing can be minimised.
I was wondering if there is another way to provide segregation at a higher level or I am insane to not use coordinator nodes for the indexing path as well.
P.S: My use case is heavy indexing and light/medium search
You cant separate indexing and search operations, indexing will write on the primary shard, then on the replica shard, whereas search can be done only on primary shards.
If you care about write performance:
no replica
refresh_interval > 30s, keep analyzer simple
lot of shards (across data nodes)
send insert/update queries on data nodes directly
try to have a hot/cold data architecture (hot/cold indices)
Coordinator nodes can not improve search performance at all, this depends on your workload (aggs etc...).
As usually, all tuning stuff depend on your data and usage, you must find the good balance between indexation and searching performance, use the _node/stats endpoint to see whats going on.

Does Elasticsearch evaluate relevance scores of documents in parallel for one search request

I have a native script to score documents. I'm wondering for a search request, are the documents scored by one thread (if a threadpool is used for search) or it's configurable to do that in parallel? (I know that docs on different nodes in the cluster can be scored in parallel. Here I mean in the same node).
Thanks
Documents belonging to the same shard are scored sequentially in a single thread. AFAIK, it cannot be configured to be done in parallel. Search operation across multiple shards potentially can happen in parallel whether they belong to the same node or different nodes.

Load Balancing Between Two elasticsearch servers

I have two ElasticSearch Servers:
http://12.13.54.333:9200
and
http://65.98.54.10:9200
In the first server I have 100k of data(id=1 to id=100k) and in the second server I have 100k of data(id=100k+1 to 200k).
I want to have a text search for the keyword obama in one request on both servers. Is this possible?
Your question is a little generic...I'll try not to give an "it depends" kind of answer, but in order to do so I have to make a couple of assumptions.
Are those two servers actually two nodes on the same elasticsearch cluster? I suppose so.
Did you index data on an elasticsearch index composed of more than one shard? I suppose so. The default in elasticsearch is five shards, which in your case would lead to having two shards on one node and three on the other.
Then you can just send your query to one of those nodes via REST API. The query will be executed on all the shards that the index (can be even more than one) you are querying is composed of. If you have replicas the replica shards might be used too at query time. The node that received your query will then reduce the search results got from all the shards returning back the most relevant ones.
To be more specific the search phase on every shard will most likely only collect the document ids and their score. Once the node that you hit has reduced the results, it can fetch all the needed fields (usually the _source field) only for the documents that it's supposed to return.
What's nice about elasticsearch is that even if you indexed data on different indexes you can query multiple indices and everything is going to work the same as I described. At the end of the day every index is composed of shards, and querying ten indices with one shard each is the same as querying one index with ten shards.
What I described applies to the default search_type that elasticsearch uses, called query_then_fetch. There are other search types that you can eventually use when needed, like for example the count which doesn't do any reduce nor fetch but just returns the number of hits for a query executing it on all shards and returning the sum of all the hits for each shard.
Revendra Kumar,
Elasticsearch should handler that for you. Elasticsearch was built from scratch to be distributed and do distributed search.
Basically, if those servers are in the same cluster, you will have a two shards (the first one holds the id from 1 to 100k and the second one hold the ids from 100001 to 200k). When you search by something, it doesn't matter which server it hits, it will do a search on both servers and returns the result for the client. The internal behavior of elasticsearch is too extensive to explain here.

Resources