Fetch phase elapsed time when searching in Elasticsearch using Spark - elasticsearch

I make queries to Elasticsearch using Spark. As it says in the documentation Spark creates tasks accordingly to the number of Elasticsearch shards (e.g. for 32 shards there will be 32 Spark tasks). Each task connects and retrieves data from separate Elasticsearch shard.
Also there is a description of fetch phase (from Elasticsearch: The Definitive Guide — Distributed Search Execution » Fetch Phase):
The distributed phase consists of the following steps:
The coordinating node identifies which documents need to be fetched
and issues a multi GET request to the relevant shards.
Each shard loads the documents and enriches them, if required, and
then returns the documents to the coordinating node.
Once all documents have been fetched, the coordinating node returns
the results to the client.
In Elasticsearch-Spark solution we have different algorithm since there is no coordinating node:
Shard loads the documents and enriches them, if required.
Elasticsearch returns the shard results to the client (Spark task).
My question is as follows:
I look at the elapsed time of fetch phase in the slow log. Does elapsed time includes the transfer of all data from shard to client (Spark task)? Or it includes only time to retrieve data from filesystem?

Related

In a 3 node Elasticsearch cluster, a search is distributed through all nodes?

If I have 3 data nodes and perform a query with a lot of aggregations, this search is distributed through all cluster data nodes?
Or the Elasticsearch elects one node to query and aggregate the data? Acting as a load balancer and not as like a "distributed map/reduce"
If the index you're querying contains more than one shard (whether primary or replica), then those shards will be located on different nodes, hence the query will be distributed to each node that hosts a shard of the index you're querying.
One data node will receive your request and act as the coordinating node. It will check the cluster state to figure out where the shards are located, then it will forward the request to each node hosting a shard, gather the results and send them back to the client.

setting up a basic elasticsearch cluster

Im new to elasticsearch and would like someone to help me clarify a few concepts
Im designing a small cluster with the following requirements
everything should still work when restarting one of the machines, one at a time (eg: OS updates)
a single disk failure is ok
heavy indexing should not impact query performance
How many master, data, ingest nodes should I have?
or do I need 2 clusters?
the indexing workload is purely indexing structured text documents, no processing/rules... do I even need an ingest node?
Also, does each node have a complete copy of the all the data? or only a cluster has the complete copy?
Be sure to read the documentation about Elasticsearch terminology at the very least.
With the default of 1 replica (primary shard and one replica shard) you can survive the failure of 1 Elasticsearch node (failed disk, restart, upgrade,...).
"heavy indexing should not impact query performance": You'll need to size your cluster correctly to handle both the indexing and searching. If you want to read current data and you do heavy updates, that will take up resources and you won't be able to fully decouple it.
By default every node is a data, ingest, and master-eligible node. The minimum HA setting needs 3 nodes. If you don't use ingest that's fine; it won't take up resources when you're not using it.
To understand which node has which data, you need to read up on the concept of shards. Basically every index is broken up into 1 to N shards (current default is 5) and there is one primary and one replica copy of each one of them (by default).

How to segregate Elasticsearch index and search path as much as possible

I am planning to segregate Elasticsearch index and search requests as much as possible to avoid any unnecessary delay in the indexing process. There is no such a thing as an Elasticsearch dedicated search node or index node. However, I was wondering if the following scenario is suitable. As far as I understood, I cannot segregate search requests from index requests completely because at the end both hit ES data nodes, but it is what I think can help a little:
Few Elasticsearch Coordinator nodes (No master/data) to deal with search requests and route them to the corresponding data node. Hence, for creating search client to deal with search requests, coordinator node URL will be used only.
Use Elasticsearch data nodes directly for the index path and ignore coordinator nodes for indexing.
In this case, the receiving data node will act as a coordinator node for indexing path and dedicated coordinator nodes will be used to route to a replica on data nodes. Data node unnecessary load due to search routing can be minimised.
I was wondering if there is another way to provide segregation at a higher level or I am insane to not use coordinator nodes for the indexing path as well.
P.S: My use case is heavy indexing and light/medium search
You cant separate indexing and search operations, indexing will write on the primary shard, then on the replica shard, whereas search can be done only on primary shards.
If you care about write performance:
no replica
refresh_interval > 30s, keep analyzer simple
lot of shards (across data nodes)
send insert/update queries on data nodes directly
try to have a hot/cold data architecture (hot/cold indices)
Coordinator nodes can not improve search performance at all, this depends on your workload (aggs etc...).
As usually, all tuning stuff depend on your data and usage, you must find the good balance between indexation and searching performance, use the _node/stats endpoint to see whats going on.

Is it possible to run two nodes in elasticsearch but only allow querying on one?

We have an elastic search cluster set up with two nodes. We want the second node only for replication as load isn't enough to warrant a second node. All primary shards are on the master.
Now here's the problem, every other query gets forwarded to the secondary node. As a result, query times are doubled. I expect this is due to elasticsearch's load balancing.
Is there a way to prevent queries from being delegated?
If you specify preference=_local on the search request url, the request will be executed on the node that received the request (assuming that this node has required shards allocated on it). See http://www.elasticsearch.org/guide/reference/api/search/preference/ for more information.

Load Balancing Between Two elasticsearch servers

I have two ElasticSearch Servers:
http://12.13.54.333:9200
and
http://65.98.54.10:9200
In the first server I have 100k of data(id=1 to id=100k) and in the second server I have 100k of data(id=100k+1 to 200k).
I want to have a text search for the keyword obama in one request on both servers. Is this possible?
Your question is a little generic...I'll try not to give an "it depends" kind of answer, but in order to do so I have to make a couple of assumptions.
Are those two servers actually two nodes on the same elasticsearch cluster? I suppose so.
Did you index data on an elasticsearch index composed of more than one shard? I suppose so. The default in elasticsearch is five shards, which in your case would lead to having two shards on one node and three on the other.
Then you can just send your query to one of those nodes via REST API. The query will be executed on all the shards that the index (can be even more than one) you are querying is composed of. If you have replicas the replica shards might be used too at query time. The node that received your query will then reduce the search results got from all the shards returning back the most relevant ones.
To be more specific the search phase on every shard will most likely only collect the document ids and their score. Once the node that you hit has reduced the results, it can fetch all the needed fields (usually the _source field) only for the documents that it's supposed to return.
What's nice about elasticsearch is that even if you indexed data on different indexes you can query multiple indices and everything is going to work the same as I described. At the end of the day every index is composed of shards, and querying ten indices with one shard each is the same as querying one index with ten shards.
What I described applies to the default search_type that elasticsearch uses, called query_then_fetch. There are other search types that you can eventually use when needed, like for example the count which doesn't do any reduce nor fetch but just returns the number of hits for a query executing it on all shards and returning the sum of all the hits for each shard.
Revendra Kumar,
Elasticsearch should handler that for you. Elasticsearch was built from scratch to be distributed and do distributed search.
Basically, if those servers are in the same cluster, you will have a two shards (the first one holds the id from 1 to 100k and the second one hold the ids from 100001 to 200k). When you search by something, it doesn't matter which server it hits, it will do a search on both servers and returns the result for the client. The internal behavior of elasticsearch is too extensive to explain here.

Resources