I have two ElasticSearch Servers:
http://12.13.54.333:9200
and
http://65.98.54.10:9200
In the first server I have 100k of data(id=1 to id=100k) and in the second server I have 100k of data(id=100k+1 to 200k).
I want to have a text search for the keyword obama in one request on both servers. Is this possible?
Your question is a little generic...I'll try not to give an "it depends" kind of answer, but in order to do so I have to make a couple of assumptions.
Are those two servers actually two nodes on the same elasticsearch cluster? I suppose so.
Did you index data on an elasticsearch index composed of more than one shard? I suppose so. The default in elasticsearch is five shards, which in your case would lead to having two shards on one node and three on the other.
Then you can just send your query to one of those nodes via REST API. The query will be executed on all the shards that the index (can be even more than one) you are querying is composed of. If you have replicas the replica shards might be used too at query time. The node that received your query will then reduce the search results got from all the shards returning back the most relevant ones.
To be more specific the search phase on every shard will most likely only collect the document ids and their score. Once the node that you hit has reduced the results, it can fetch all the needed fields (usually the _source field) only for the documents that it's supposed to return.
What's nice about elasticsearch is that even if you indexed data on different indexes you can query multiple indices and everything is going to work the same as I described. At the end of the day every index is composed of shards, and querying ten indices with one shard each is the same as querying one index with ten shards.
What I described applies to the default search_type that elasticsearch uses, called query_then_fetch. There are other search types that you can eventually use when needed, like for example the count which doesn't do any reduce nor fetch but just returns the number of hits for a query executing it on all shards and returning the sum of all the hits for each shard.
Revendra Kumar,
Elasticsearch should handler that for you. Elasticsearch was built from scratch to be distributed and do distributed search.
Basically, if those servers are in the same cluster, you will have a two shards (the first one holds the id from 1 to 100k and the second one hold the ids from 100001 to 200k). When you search by something, it doesn't matter which server it hits, it will do a search on both servers and returns the result for the client. The internal behavior of elasticsearch is too extensive to explain here.
Related
I have been learning about Elasticsearch for some time now.
I want to see if the following statement es correct:
Elasticsearch manages such high speeds because you can split data that is in the same index between several nodes that will take a GET query and run it at the same time.
Meaning if I have three pieces of data in the "book" index
{"name": "Pinocchio"}
{"name": "Frozen"}
{"name": "Diary of A Wimpy Kid"}
And I decide to give the cluster three nodes, each node will hold one of the three books and therefore speed up my get request 3x?
Yes, there's much more to it, but that's pretty much what happens behind the scene.
Provided your index has three primary shards and each shard lands on a different node and contains one of the documents in your question, when you execute a query on your index, the query gets broadcast to each of the shards of your index and is executed on each node in parallel to search the documents on that node.
You have mentioned the one of the advantages of Elasticsearch as it distributes data (Shards and Replica) on multiple server and query will be executed parallel. it is useful for High Availibility as well.
Another reason is due to how elasticsearch internally store data. It use Lucene which stored data in inverted Index.
You can check below link for more explanation:
Why Elasticsearch is fatser comapre to raw SQL command
How Elasticsearch Search So Fast?
How is Elasticsearch so fast?
We have different document structures/schema that we on-board into different elasticsearch indices. We have ~50 such indices, and one of our primary use cases is to perform search across all these document types i.e. across all 50 indices. Data size within each index is ~10-20 GB, thus each index easily fits into a single shard.
I am looking for ways to optimize the performance in search across these 50 indices. We have a particular common field across all these indices which is available within a user's search request, and could be used for sharding within each index if we had more than one shards per index. Not sure if we could make use of it somehow to optimize the performance for this multi-index search, or any other alternate options.
You need to provide more information in order to get some concrete answer, please provide below information.
Your Sample search query with its avg time taken across good no of calls.
data nodes heap size
How many documents you are fetching in your search query ie size param.
Search slow logs of elasticsearch for your search query.
Total no of data nodes and and each indices replicas in your cluster, where you are performing search queries?
I am going through the documentation to better understand the role of coordinating node, the different phases of search request -
I come across a phase -
Each shard returns just enough information to the coordinating node
What sort of information this phrase refers - "just enough information" ?
If we had complex queries like bool queries, aggregation - I presume coordinating need to execute the same query again to aggregation the results globally, in that case, coordinate node will also have some come kind of lucene engine running to aggregate the results ?
A coordinating node can have(when act as a data-node and your index shard is present on it) or can't have the data(when used as a dedicated coordinating node or your index's shard isn't present) of your index.
All it does gather the result from all other participating data nodes in the query and create a priority queue and return the top result.
To answer your question,
I presume coordinating need to execute the same query again to
aggregation the results globally, in that case, coordinate node will
also have some come kind of lucene engine running to aggregate the
results ?
No, the co-ordinating node will not aggregate the results and will not query again to aggregate to produce global results.
You can think of it, you need to get the top 10 documents in your index, assume you have 5 shards and 5 data nodes, then every shard will calcualte the top 10 documents and send it co-ordinating node with the score, now coordinating node, will create a priority queue and returns the top 10 documents, but for that it doesn't have to run another query or it just have to sort the top 50 documents returned from 5 data nodes which already have score and returns the top 10 docs.
Good read on this https://discuss.elastic.co/t/how-does-elasticsearch-process-a-query/191181 and https://www.elastic.co/blog/elasticsearch-query-execution-order
I'm new to ElasticSearch.
Lets suppose I've 10000 documents. The relevant field in the documents are such that after getting indexed most of them would end up in a single shard.
Would ElasticSearch rebalance this "skewed" distribution for, may be better load balancing?
If I got you question right, the short answer - no, the documents will not be relocated. Choosing shard is based on modulo-like distribution, and its used for index as well as for retrieval.
So, if (theoretically) ES will rebalance such docs, you'll be unable to retrieve them with you routing key, as it will leads to original shard (which is empty in such theoretical case).
The "distribution" part of docs if nice place for further reading
I don't exactly understand what you mean by this "the relevant field in the documents are such that after getting indexed most of them would end up in a single shard".
From what I understand, ElasticSearch will automatically balances the shards between all the nodes started on your setup to be the most effective possible.
The document are indexed on a shard with the field. The same document cannot have some fields on node 1 and some other fields on node 2.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm in the middle of attempting to replace a Solr setup with Elasticsearch. This is a new setup, which has not yet seen production, so I have lots of room to fiddle with things and get them working well.
I have very, very large amounts of data. I'm indexing some live data and holding onto it for 7 days (by using the _ttl field). I do not store any data in the index (and disabled the _source field). I expect my index to stabilize around 20 billion rows. I will be putting this data into 2-3 named indexes. Search performance so far with up to a few billion rows is totally acceptable, but indexing performance is an issue.
I am a bit confused about how ES uses shards internally. I have created two ES nodes, each with a separate data directory, each with 8 indexes and 1 replica. When I look at the cluster status, I only see one shard and one replica for each node. Doesn't each node keep multiple indexes running internally? (Checking the on-disk storage location shows that there is definitely only one Lucene index present). -- Resolved, as my index setting was not picked up properly from the config. Creating the index using the API and specifying the number of shards and replicas has now produced exactly what I would've expected to see.
Also, I tried running multiple copies of the same ES node (from the same configuration), and it recognizes that there is already a copy running and creates its own working area. These new instances of nodes also seem to only have one index on-disk. -- Now that each node is actually using multiple indices, a single node with many indices is more than sufficient to throttle the entire system, so this is a non-issue.
When do you start additional Elasticsearch nodes, for maximum indexing performance? Should I have many nodes each running with 1 index 1 replica, or fewer nodes with tons of indexes? Is there something I'm missing with my configuration in order to have single nodes doing more work?
Also: Is there any metric for knowing when an HTTP-only node is overloaded? Right now I have one node devoted to HTTP only, but aside from CPU usage, I can't tell if it's doing OK or not. When is it time to start additional HTTP nodes and split up your indexing software to point to the various nodes?
Let's clarify the terminology a little first:
Node: an Elasticsearch instance running (a java process). Usually every node runs on its own machine.
Cluster: one or more nodes with the same cluster name.
Index: more or less like a database.
Type: more or less like a database table.
Shard: effectively a lucene index. Every index is composed of one or more shards. A shard can be a primary shard (or simply shard) or a replica.
When you create an index you can specify the number of shards and number of replicas per shard. The default is 5 primary shards and 1 replica per shard. The shards are automatically evenly distributed over the cluster. A replica shard will never be allocated on the same machine where the related primary shard is.
What you see in the cluster status is weird, I'd suggest to check your index settings using the using the get settings API. Looks like you configured only one shard, but anyway you should see more shards if you have more than one index. If you need more help you can post the output that you get from elasticsearch.
How many shards and replicas you use really depends on your data, the way you access them and the number of available nodes/servers. It's best practice to overallocate shards a little in order to redistribute them in case you add more nodes to your cluster, since you can't (for now) change the number of shards once you created the index. Otherwise you can always change the number of shards if you are willing to do a complete reindex of your data.
Every additional shard comes with a cost since each shard is effectively a Lucene instance. The maximum number of shards that you can have per machine really depends on the hardware available and your data as well. Good to know that having 100 indexes with each one shard or one index with 100 shards is really the same since you'd have 100 lucene instances in both cases.
Of course at query time if you want to query a single elasticsearch index composed of 100 shards elasticsearch would need to query them all in order to get proper results (unless you used a specific routing for your documents to then query only a specific shard). This would have a performance cost.
You can easily check the state of your cluster and nodes using the Cluster Nodes Info API through which you can check a lot of useful information, all you need in order to know whether your nodes are running smoothly or not. Even easier, there are a couple of plugins to check those information through a nice user interface (which internally uses the elasticsearch APIs anyway): paramedic and bigdesk.