I am looking for storing particular field in particular shard in elasticsearch - elasticsearch

By routing we can allocate particular file/doc/json in particular shard which make it easy to extract data.
But I am thinking as would it be possible to store particular field of json file in particular shard.
for eg:
i had three field : username , message and time. I had created 3 shard for indexing.
Now i want that
username is stored in one shard , message in another shard and time in another shard.
Thanks

No this is not possible. The whole document (the JSON doc) will be stored on one shard. If you want to do what you describe, then you should split the data up into separate docs and then you can route them differently.
As for the reasoning, imagine there was a username query which matched document5. If document5 was spread over many shards, these would all have to be queried to get the other parts of document5 back to compile the results. Imagine further a complex AND query across different fields, there would be a lot of traffic (and waiting) to find out if both fields match to compute if the document was a hit or not.

Related

Elastic Search Monthly Rolling index with custom routing

I am trying to figure out the how to create a monthly rolling index with custom routing (multi-tenancy scenario) , with these requirements :
WRITE flow : Each document will have a timestamp and the document should be indexed to the appropriate backing index based on that timestamp and not to the latest index. Also, write requests will have a custom routing key (eg: customerId) so they hit a specific shard.
READ flow : Requests must be routed to all backing indexes. Requests will have a custom routing key specified (eg: customerId) and results must be aggregated and returned.
Index creation : Rolling the index should be automated. Each index should have a custom routing key (eg: customerId )
Wondering, what are the options available ?
This very feature, called time-series data stream, will be coming in the upcoming ES 8.5 release.
The big difference between normal data streams and time-series data stream is that all backing indexes of TSDS are sorted by timestamp and all documents will be written in the right backing index for the given time frame of the document, even if that backing index is not the current write index, which means if your data source lags (even by a few hours), the data will still land in the right index. Also all documents related to the same dimension (i.e. customerId in your case) will end up on the same shard.
Another difference is that the ID of the documents is computed as a function of the timestamp and the dimension(s) contained in the document, which means there can only be one single occurence for a given timestamp/dimension pair (i.e. no duplicate).
Technically, you can already achieve pretty much the same with normal data streams, however, the underlying optimizations related to storing docs in the same shard and the ability to write documents to older backing indexes won't be possible since you can only index documents in the current write index.

Elasticsearch multiple index request on same document id in bulk api

We are using elastic search 6.0 and using bulk indexing to index many documents in a single request using “index” action. In a single request we can have a scenario where there are multiple “index” requests on same document. Will ES fail the bulk request in such case OR it will process all of them in order?
Edit1: I use a script for indexing in bulk request where we are handling out of order updates. So as long as all “index” requests are getting processed, we don’t have any issue.
ES will not fail, but it is not necessarily clear which indexing operation will "win". It might be the last one but since all operations in the bulk batch might be spread over several ingest nodes, and not all of those nodes process the indexing operations at the same rate, it might not be clear which operation will be processed first and which will be processed last.
The only guarantee that you have is that in the response, you'll get the state of each operation in the same order as specified in the request batch.
If your index has only one primary shard, then the order in which you submit the operations will be the same order as the one those operations are processed, hence the last one wins, but if you have more than one primary shard on more than one node, then you can't really know.
A better question would be why do you submit several indexing operations per document knowing in advance that only one will win?

Index type in elasticsearch

I am trying to understand and effectively use the index type available in elasticsearch.
However, I am still not clear how _type meta field is different from any regular field of an index in terms of storage/implementation. I do understand avoiding_type_gotchas
For example, if I have 1 million records (say posts) and each post has a creation_date. How will things play out if one of my index types is creation_date itself (leading to ~ 1 million types)? I don't think it affects the way Lucene stores documents, does it?
In what way my elasticsearch query performance be affected if I use creation_date as index type against a namesake type say 'post'?
I got the answer on elastic forum.
https://discuss.elastic.co/t/index-type-effective-utilization/58706
Pasting the response as is -
"While elasticsearch is scalable in many dimensions there is one where it is limited. This is the metadata about your indices which includes the various indices, doc types and fields they contain.
These "mappings" exist in memory and are updated and shared around all nodes with every change. For this reason it does not make sense to endlessly grow the list of indices, types (and therefore fields) that exist in this cluster state. A type-per-document-creation-date registers a million on the one-to-ten scale of bad design decisions" - Mark_Harwood

Elasticsearch get multiple documents by uids over multiple indices

The previous setting was all documents of one type were in the same index. But due to different forms (conceptually) of types, and for backing up purposes, I need multiple indices of a single type.
They will all be in the form _feed. While this setting is great in some circumstances, for
client.prepareGet(index, typename, ids).execute().actionGet(); // works great if you know in which index to search
it is useless, since no wildcards may be used. What I can do is use multiple multigets and interleave the results. This results in what I want, but increase the amount of queries significantly.
Assuming I know, for sure, only one document exist with a given index, is there a better way to query does than call a multiget on all _uids for each possible index?
The best way would be to develop a mechanism in your application that would allow you to deduce the index name from the id. But assuming that this is not possible or practical, you have pretty much only two choices. If you need realtime get, then your approach is the only way to do it. If realtime get is not a requirement, you can perform a search across all indices using ids filter. If the id list is small you can benefit from using routing on your search query. This way the search request will only be dispatch to the shards that might contain any of the ids listed in the query. However, if the list of ids is big enough to span most of the shards, it will not provide any benefit.

Query and allocate data to shards based on tags

I'm running a typical logstash-redis-elasticsearch system to capture all my logs(around 500 GB/day). To my knowledge elasticsearch queries every shard in an index and aggregates the results, but due to the volume of logs per day and the response times needed, I want to query only few shards which of course should be decided on some "tag" in the message. So I'm looking at a way to allocate data to shards based on some tags and query only relevant shards based on the tags. Any leads, references or solutions on how to achieve this ?
I've already looked at shard allocation filtering but that doesn't cater this specific requirement.
Routing is the way to go here.
Specify a route option while indexing will cause the document to be routed on a specific shard. See routing in index API.
You can also extract the routing value from a field. See routing field.
Don't forget to search with the same routing value. See routing option in search.

Resources