How does Elasticsearch query by text so fast? - elasticsearch

I have been learning about Elasticsearch for some time now.
I want to see if the following statement es correct:
Elasticsearch manages such high speeds because you can split data that is in the same index between several nodes that will take a GET query and run it at the same time.
Meaning if I have three pieces of data in the "book" index
{"name": "Pinocchio"}
{"name": "Frozen"}
{"name": "Diary of A Wimpy Kid"}
And I decide to give the cluster three nodes, each node will hold one of the three books and therefore speed up my get request 3x?

Yes, there's much more to it, but that's pretty much what happens behind the scene.
Provided your index has three primary shards and each shard lands on a different node and contains one of the documents in your question, when you execute a query on your index, the query gets broadcast to each of the shards of your index and is executed on each node in parallel to search the documents on that node.

You have mentioned the one of the advantages of Elasticsearch as it distributes data (Shards and Replica) on multiple server and query will be executed parallel. it is useful for High Availibility as well.
Another reason is due to how elasticsearch internally store data. It use Lucene which stored data in inverted Index.
You can check below link for more explanation:
Why Elasticsearch is fatser comapre to raw SQL command
How Elasticsearch Search So Fast?
How is Elasticsearch so fast?

Related

How many indexes can I create in elastic search?

I am very new to elastic search and its applications, I found that elastic search saves data(indexes) onto disk. Then I wondered: Are there any limitations on number of indexes that can be created or can I create as many as I can since I have a very large disk space?
Currently I have elastic search deployed using a single node cluster with Docker. I have read something about shards and its limitation etc., but I was not able to understand it properly.
Is there anyone on SO, who can shed some light onto these questions for a newbie in layman terms?
What is a single node cluster and how does my data get saved onto disk? Also what are shards and how is it related to elastic search?
I guess the best answer is "it depends ". Generally there is no limitation for having many indexes , Every index has its own mapping and irrelevant to other indexes by default, Actually indexes are instance of Elasticsearch servers and please note that they are not data rather you may think about as entire database alone. There are many variables for answering this question for example if are planning to have replication of your shards in one index then you may found limitation due to the size of document you are planning to ingest inside the index.
As an other note you may need to ask first why I need many indexes ? for enhancing search operation or queries throughput? if it is the case then perhaps its better to use replica shards beside your primary shards in the single index because the queries are executed parallel to each other in replica shards and you can think of shards as an stand alone index inside of your main index so in conclusion I can say there is no limitation as long as you have enough free space to save new data (expanding inverted indexes table created for on field) but regarding to you needs it may be better to have primary and replica shards inside an index .

ElasticSearch Performance Optimization in Multi-Index Search over a Small Data Set per Index

We have different document structures/schema that we on-board into different elasticsearch indices. We have ~50 such indices, and one of our primary use cases is to perform search across all these document types i.e. across all 50 indices. Data size within each index is ~10-20 GB, thus each index easily fits into a single shard.
I am looking for ways to optimize the performance in search across these 50 indices. We have a particular common field across all these indices which is available within a user's search request, and could be used for sharding within each index if we had more than one shards per index. Not sure if we could make use of it somehow to optimize the performance for this multi-index search, or any other alternate options.
You need to provide more information in order to get some concrete answer, please provide below information.
Your Sample search query with its avg time taken across good no of calls.
data nodes heap size
How many documents you are fetching in your search query ie size param.
Search slow logs of elasticsearch for your search query.
Total no of data nodes and and each indices replicas in your cluster, where you are performing search queries?

Scaling horizontally meaning

I am learning ElasticSearch and in their documentation it's written this line
Performing full SQL-style joins in a distributed system like
Elasticsearch is prohibitively expensive. Instead, Elasticsearch
offers two forms of join which are designed to scale horizontally.
Please someone explain me in layman term what does the 2nd sentence means.
As a preamble you might want to go through another thread on SO that explains horizontal vs vertical scaling.
Most of the time, an ES cluster is designed to grow horizontally, meaning that whenever your cluster starts to show some signs of weaknesses (slow queries, slow indexing, etc), all you need to do is add one or more nodes to your cluster and ES will spread the load on more hardware, and thus, lighten the burden on existing nodes. That's what horizontal scaling is all about and ES is perfectly designed for this given the way it partitions the indexes into shards that get assigned to the nodes in your cluster.
As you know, ES has no JOIN feature and they did it on purpose for the reason mentioned above (i.e. "prohibitively expensive"). There are four ways to model relationships in ES:
by denormalizing your data (preferred)
by using nested types
by using parent/child documents
by using application-side joins
The link you referred to, which introduces the nested, has_parent and has_child queries, is about the second and third bullet point above. Nested and parent/child documents have been designed in such a way as to take advantage as much as possible from the index/shard partitioning model that ES supports.
When using a nested field (1-N relationship), each element inside of the nested array is just another hidden document under the hood and is stored in a shard somewhere in your cluster. When using a join field (1-N relationship), parent and child documents are also documents stored in your index within a shard located somewhere in your cluster. When your index grows (i.e. when you have more and more parent and child and/or nested data), you add nodes and the shards containing your documents will get spread within the cluster transparently. This means that wherever your documents are stored, you can retrieve them as well as their related documents without having to perform expensive joins.
So you will get more information about scaling horizontal here
In Elasticsearch terms when you start two or more instances on ES in same network with same cluster configs then they will connect to each other and create a distributed network.So if you add one more computer or node and started one ES instance there and keep the cluster config same that node will automatically will get attached to the previous cluster and the data and the request load will be shared .When you make any request to ES may be its a read or write request each request can be processed parallel and you get the speed according to the no of node and shards in them of each index.
Get more information here

elasticsearch auto rebalancing of data across shards

I'm new to ElasticSearch.
Lets suppose I've 10000 documents. The relevant field in the documents are such that after getting indexed most of them would end up in a single shard.
Would ElasticSearch rebalance this "skewed" distribution for, may be better load balancing?
If I got you question right, the short answer - no, the documents will not be relocated. Choosing shard is based on modulo-like distribution, and its used for index as well as for retrieval.
So, if (theoretically) ES will rebalance such docs, you'll be unable to retrieve them with you routing key, as it will leads to original shard (which is empty in such theoretical case).
The "distribution" part of docs if nice place for further reading
I don't exactly understand what you mean by this "the relevant field in the documents are such that after getting indexed most of them would end up in a single shard".
From what I understand, ElasticSearch will automatically balances the shards between all the nodes started on your setup to be the most effective possible.
The document are indexed on a shard with the field. The same document cannot have some fields on node 1 and some other fields on node 2.

Load Balancing Between Two elasticsearch servers

I have two ElasticSearch Servers:
http://12.13.54.333:9200
and
http://65.98.54.10:9200
In the first server I have 100k of data(id=1 to id=100k) and in the second server I have 100k of data(id=100k+1 to 200k).
I want to have a text search for the keyword obama in one request on both servers. Is this possible?
Your question is a little generic...I'll try not to give an "it depends" kind of answer, but in order to do so I have to make a couple of assumptions.
Are those two servers actually two nodes on the same elasticsearch cluster? I suppose so.
Did you index data on an elasticsearch index composed of more than one shard? I suppose so. The default in elasticsearch is five shards, which in your case would lead to having two shards on one node and three on the other.
Then you can just send your query to one of those nodes via REST API. The query will be executed on all the shards that the index (can be even more than one) you are querying is composed of. If you have replicas the replica shards might be used too at query time. The node that received your query will then reduce the search results got from all the shards returning back the most relevant ones.
To be more specific the search phase on every shard will most likely only collect the document ids and their score. Once the node that you hit has reduced the results, it can fetch all the needed fields (usually the _source field) only for the documents that it's supposed to return.
What's nice about elasticsearch is that even if you indexed data on different indexes you can query multiple indices and everything is going to work the same as I described. At the end of the day every index is composed of shards, and querying ten indices with one shard each is the same as querying one index with ten shards.
What I described applies to the default search_type that elasticsearch uses, called query_then_fetch. There are other search types that you can eventually use when needed, like for example the count which doesn't do any reduce nor fetch but just returns the number of hits for a query executing it on all shards and returning the sum of all the hits for each shard.
Revendra Kumar,
Elasticsearch should handler that for you. Elasticsearch was built from scratch to be distributed and do distributed search.
Basically, if those servers are in the same cluster, you will have a two shards (the first one holds the id from 1 to 100k and the second one hold the ids from 100001 to 200k). When you search by something, it doesn't matter which server it hits, it will do a search on both servers and returns the result for the client. The internal behavior of elasticsearch is too extensive to explain here.

Resources