I have an Elasticsearch index with hundreds of millions of documents that is used mostly for the "get by id" queries.
I consider adding a route to documents when indexing. The route will be a random number from 0 to 9.
Later this route will be in an URL with the document id, it that will be used to get the document. Currently I have only document ids in the URLs but plan to add routes as well. The new URL will look like this https://tarta.ai/ j/ [route] /[doc id].
I'm wondering will it decrease the time needed to find a document in the index? My suggestion is that Elasticsearch in this case won't look for a document in all the shards but instead will look only in the shards with this particular route.
Some elasticsearch specs:
index size is 110gb.
the number of docs is 36m but we're adding hundreds of thousands every day.
5 shards.
16GB RAM and 2 core vm with a 1t SSD.
Routing is especially useful when searching (i.e. POST index/_search), because instead of searching all the shards of an index, ES will only search the single shard to which the routing value resolves to.
If you specify a routing value when indexing a document, you MUST specify the same routing value when GETting that document, there's no alternative.
# index with routing
PUT index/_doc/1?routing=123
# returns the document
GET index/_doc/1?routing=123
# returns nothing
GET index/_doc/1
If you don't specify any routing value when indexing a document, ES will use the ID as the routing value and store the document on the shard to which that routing value resolves to. So when GETting such a document without any routing value, ES will know to use the ID as the routing value, which is why you don't have to specify it (although you could).
# index without routing
PUT index/_doc/1
# returns the document
GET index/_doc/1
# returns the document as well
GET index/_doc/1?routing=1
What that means is that for GET operations, routing or no routing makes no difference, it's mainly for search that routing has an added value.
When using routing, another thing you need to make sure is that your IDs should be "well-balanced" so that their hash can resolve to any shard with a similar probability. If that's not the case, you run the risk of creating hot-spots in your index, i.e. some shards can be bigger than others.
Related
I am trying to figure out the how to create a monthly rolling index with custom routing (multi-tenancy scenario) , with these requirements :
WRITE flow : Each document will have a timestamp and the document should be indexed to the appropriate backing index based on that timestamp and not to the latest index. Also, write requests will have a custom routing key (eg: customerId) so they hit a specific shard.
READ flow : Requests must be routed to all backing indexes. Requests will have a custom routing key specified (eg: customerId) and results must be aggregated and returned.
Index creation : Rolling the index should be automated. Each index should have a custom routing key (eg: customerId )
Wondering, what are the options available ?
This very feature, called time-series data stream, will be coming in the upcoming ES 8.5 release.
The big difference between normal data streams and time-series data stream is that all backing indexes of TSDS are sorted by timestamp and all documents will be written in the right backing index for the given time frame of the document, even if that backing index is not the current write index, which means if your data source lags (even by a few hours), the data will still land in the right index. Also all documents related to the same dimension (i.e. customerId in your case) will end up on the same shard.
Another difference is that the ID of the documents is computed as a function of the timestamp and the dimension(s) contained in the document, which means there can only be one single occurence for a given timestamp/dimension pair (i.e. no duplicate).
Technically, you can already achieve pretty much the same with normal data streams, however, the underlying optimizations related to storing docs in the same shard and the ability to write documents to older backing indexes won't be possible since you can only index documents in the current write index.
I am trying to understand and effectively use the index type available in elasticsearch.
However, I am still not clear how _type meta field is different from any regular field of an index in terms of storage/implementation. I do understand avoiding_type_gotchas
For example, if I have 1 million records (say posts) and each post has a creation_date. How will things play out if one of my index types is creation_date itself (leading to ~ 1 million types)? I don't think it affects the way Lucene stores documents, does it?
In what way my elasticsearch query performance be affected if I use creation_date as index type against a namesake type say 'post'?
I got the answer on elastic forum.
https://discuss.elastic.co/t/index-type-effective-utilization/58706
Pasting the response as is -
"While elasticsearch is scalable in many dimensions there is one where it is limited. This is the metadata about your indices which includes the various indices, doc types and fields they contain.
These "mappings" exist in memory and are updated and shared around all nodes with every change. For this reason it does not make sense to endlessly grow the list of indices, types (and therefore fields) that exist in this cluster state. A type-per-document-creation-date registers a million on the one-to-ten scale of bad design decisions" - Mark_Harwood
I googled on update the docs in ES across all the shards of index if exists. I found a way (/_bulk api), but it requires we need to specify the routing values. I was not able to find the solution to my problem. If does anybody aware of the below things please update me.
Is there any way to update the doc in all the shards of an index if exists using a single update query?.
If not, is there any way to generate routing values such that we should be able to hit all shards with update query?
Ideally for bulk update, ES recommends get the documents by query which needs to get updated using scan and scroll, update the document and index them again. Internally also, ES never updates a document although it provides an Update API through scripting. It always reindexes the new document with updated field/value and deletes the older document.
Is there any way to update the doc in all the shards of an index if exists using a single update query?.
You can check the update API if its suits your purpose. Also there are plugins which can provide you update by query. Check this.
Now comes the routing part and updating all shards. If you have specified a routing value while indexing the document for very first time, then whenever you update your document, you need to set the original routing value. Otherwise ES would never know which shard did the document resided and it can send it to any shard(algo based).
If you don't use routing value, then based on the ID of the document, ES uses an algo to decide the shard it needs to go. Hence when you update a document through a bulk API and keeps the same ID without the routing, the document will be saved in the same shard as it was previous and you would see the update.
By routing we can allocate particular file/doc/json in particular shard which make it easy to extract data.
But I am thinking as would it be possible to store particular field of json file in particular shard.
for eg:
i had three field : username , message and time. I had created 3 shard for indexing.
Now i want that
username is stored in one shard , message in another shard and time in another shard.
Thanks
No this is not possible. The whole document (the JSON doc) will be stored on one shard. If you want to do what you describe, then you should split the data up into separate docs and then you can route them differently.
As for the reasoning, imagine there was a username query which matched document5. If document5 was spread over many shards, these would all have to be queried to get the other parts of document5 back to compile the results. Imagine further a complex AND query across different fields, there would be a lot of traffic (and waiting) to find out if both fields match to compute if the document was a hit or not.
I'm running a typical logstash-redis-elasticsearch system to capture all my logs(around 500 GB/day). To my knowledge elasticsearch queries every shard in an index and aggregates the results, but due to the volume of logs per day and the response times needed, I want to query only few shards which of course should be decided on some "tag" in the message. So I'm looking at a way to allocate data to shards based on some tags and query only relevant shards based on the tags. Any leads, references or solutions on how to achieve this ?
I've already looked at shard allocation filtering but that doesn't cater this specific requirement.
Routing is the way to go here.
Specify a route option while indexing will cause the document to be routed on a specific shard. See routing in index API.
You can also extract the routing value from a field. See routing field.
Don't forget to search with the same routing value. See routing option in search.