I am learning ElasticSearch and in their documentation it's written this line
Performing full SQL-style joins in a distributed system like
Elasticsearch is prohibitively expensive. Instead, Elasticsearch
offers two forms of join which are designed to scale horizontally.
Please someone explain me in layman term what does the 2nd sentence means.
As a preamble you might want to go through another thread on SO that explains horizontal vs vertical scaling.
Most of the time, an ES cluster is designed to grow horizontally, meaning that whenever your cluster starts to show some signs of weaknesses (slow queries, slow indexing, etc), all you need to do is add one or more nodes to your cluster and ES will spread the load on more hardware, and thus, lighten the burden on existing nodes. That's what horizontal scaling is all about and ES is perfectly designed for this given the way it partitions the indexes into shards that get assigned to the nodes in your cluster.
As you know, ES has no JOIN feature and they did it on purpose for the reason mentioned above (i.e. "prohibitively expensive"). There are four ways to model relationships in ES:
by denormalizing your data (preferred)
by using nested types
by using parent/child documents
by using application-side joins
The link you referred to, which introduces the nested, has_parent and has_child queries, is about the second and third bullet point above. Nested and parent/child documents have been designed in such a way as to take advantage as much as possible from the index/shard partitioning model that ES supports.
When using a nested field (1-N relationship), each element inside of the nested array is just another hidden document under the hood and is stored in a shard somewhere in your cluster. When using a join field (1-N relationship), parent and child documents are also documents stored in your index within a shard located somewhere in your cluster. When your index grows (i.e. when you have more and more parent and child and/or nested data), you add nodes and the shards containing your documents will get spread within the cluster transparently. This means that wherever your documents are stored, you can retrieve them as well as their related documents without having to perform expensive joins.
So you will get more information about scaling horizontal here
In Elasticsearch terms when you start two or more instances on ES in same network with same cluster configs then they will connect to each other and create a distributed network.So if you add one more computer or node and started one ES instance there and keep the cluster config same that node will automatically will get attached to the previous cluster and the data and the request load will be shared .When you make any request to ES may be its a read or write request each request can be processed parallel and you get the speed according to the no of node and shards in them of each index.
Get more information here
Related
I have been learning about Elasticsearch for some time now.
I want to see if the following statement es correct:
Elasticsearch manages such high speeds because you can split data that is in the same index between several nodes that will take a GET query and run it at the same time.
Meaning if I have three pieces of data in the "book" index
{"name": "Pinocchio"}
{"name": "Frozen"}
{"name": "Diary of A Wimpy Kid"}
And I decide to give the cluster three nodes, each node will hold one of the three books and therefore speed up my get request 3x?
Yes, there's much more to it, but that's pretty much what happens behind the scene.
Provided your index has three primary shards and each shard lands on a different node and contains one of the documents in your question, when you execute a query on your index, the query gets broadcast to each of the shards of your index and is executed on each node in parallel to search the documents on that node.
You have mentioned the one of the advantages of Elasticsearch as it distributes data (Shards and Replica) on multiple server and query will be executed parallel. it is useful for High Availibility as well.
Another reason is due to how elasticsearch internally store data. It use Lucene which stored data in inverted Index.
You can check below link for more explanation:
Why Elasticsearch is fatser comapre to raw SQL command
How Elasticsearch Search So Fast?
How is Elasticsearch so fast?
I am very new to elastic search and its applications, I found that elastic search saves data(indexes) onto disk. Then I wondered: Are there any limitations on number of indexes that can be created or can I create as many as I can since I have a very large disk space?
Currently I have elastic search deployed using a single node cluster with Docker. I have read something about shards and its limitation etc., but I was not able to understand it properly.
Is there anyone on SO, who can shed some light onto these questions for a newbie in layman terms?
What is a single node cluster and how does my data get saved onto disk? Also what are shards and how is it related to elastic search?
I guess the best answer is "it depends ". Generally there is no limitation for having many indexes , Every index has its own mapping and irrelevant to other indexes by default, Actually indexes are instance of Elasticsearch servers and please note that they are not data rather you may think about as entire database alone. There are many variables for answering this question for example if are planning to have replication of your shards in one index then you may found limitation due to the size of document you are planning to ingest inside the index.
As an other note you may need to ask first why I need many indexes ? for enhancing search operation or queries throughput? if it is the case then perhaps its better to use replica shards beside your primary shards in the single index because the queries are executed parallel to each other in replica shards and you can think of shards as an stand alone index inside of your main index so in conclusion I can say there is no limitation as long as you have enough free space to save new data (expanding inverted indexes table created for on field) but regarding to you needs it may be better to have primary and replica shards inside an index .
In our deployment, there are one thousand shards. The insertions are done via a distributed table with sharding jumpConsistentHash(colX, 1000). When I query for rows with colX=... and turn on send_logs_level='trace', I see the query is sent to all shards and is executed on each shard. This is limiting our QPS (queries per second). Checking with Clickhouse document, it states:
SELECT queries are sent to all the shards and work regardless of how data is distributed across the shards (they can be distributed completely randomly).
When you add a new shard, you don’t have to transfer the old data to it.
You can write new data with a heavier weight – the data will be distributed slightly unevenly, but queries will work correctly and efficiently.
You should be concerned about the sharding scheme in the following cases:
* Queries are used that require joining data (IN or JOIN) by a specific key. If data is sharded by this key, you can use local IN or JOIN instead of GLOBAL IN or GLOBAL JOIN, which is much more efficient.
* A large number of servers is used (hundreds or more) with a large number of small queries (queries of individual clients - websites, advertisers, or partners).
In order for the small queries to not affect the entire cluster, it makes sense to locate data for a single client on a single shard.
Alternatively, as we’ve done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into “layers”, where a layer may consist of multiple shards.
Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them.
Distributed tables are created for each layer, and a single shared distributed table is created for global queries.
It seems there is a solution for such small queries as in our case (the second bullet above), but I am not clear about the point. Does it mean when querying for a specific query with predicate colX=..., I need to find the corresponding "layer" that contains its rows and then query on the corresponding distributed table for this layer?
Is there a way to query on the global distributed table for these small queries?
We use Elastic search for full text search use cases. The data is metadata collected across different objects and stored as ES document. We also update the document in ES whenever the master data gets updated. So, basically it is not a logging use case.
We create one ES index (one primary and 1 replica shard) as soon as we have a tenant who gets onboard for our application. This is to ensure that the ES index is ready when the first object gets created.
We do not anticipate volume of data in the index. The data could range between few hundred of MBs per index. So this is a relatively empty index.
Also, full text search is an optional add-in feature in application, so not all tenants may opt for the same, however our technical team suggested creating index upfront.
What is the overhead of such indices on the ES performance? Are we doing anything different from best practices of ES?
Any input is appreciated.
Empty Elasticsearch index don't have much overhead, as there is actually no data in them, only places where empty indices data is present in the cluster state(index mapping, setting etc) which every node in the cluster has and any change in the index mapping or settings ie index metadata updates the cluster state and gets updated on all the nodes in ES cluster.
If you have sufficient memory and ES heap size, you don't have to worry at all about these empty indices which IMO makes sense considering your use-case.
If I have an ES cluster and an application indexing data into ES.
EDIT: The application creates indices in a dynamic way based on some business rules.
For example, if the application listen to tweets from Twitter API based on some hashtags it creates an index in ES for each hashtag.
This way, each time a new hashtag comes, a new index is created in ES.
Sometimes, shard reallocation happen and at this stage, the cluster behaves poorly as the amount of data moved between nodes is huge.
From ES cluster API, we can disable shard reallocation and balancing.
What will be the effects (positive and negative) of disabling the reallocation and balancing?
This sounds like a quite unorthodox way of organizing documents in Elasticsearch, wouldn't it be simpler to have a string not_analyzed field which would be an array of hashtabs (as a single tweet can have zero, one two or more hashtags).
If there was only one hashtag / tweet you could use it for routing them to a specific shard, if search performance is a concern for you.
Anyway, if you disable shard balancing then some machines would have increasingly disproportionate amount of documents on some machines and too few on others, this could hamper indexing and searching performance.
Also if you don't have any replicas of shards then in the event of a node shutdown part of you data would become inaccessible. I'm sure in the long run there are other downsides as well.