ElasticSearch: mappings for fields that are sorted often - elasticsearch

Suppose I have a field "epoch_date" that will be sorted often when I do Elastic Search queries. How should I map this field? Right now, I just have stored: yes. Should I index it even though this field will not count towards the relevancy scoring? What should I add to this field if I intend to sort on this field often, so it will be more efficient?
{
"tweet" : {
"properties" : {
"epoch_date" : {
"type" : "integer",
"store" : "yes"
}
}
}
}

There's nothing you need to change to sort on the field given your mapping. You can only sort on a field if it's indexed, and the default is "index":"yes" for numeric or dates. You can not set a numeric type to analyzed, since there's no text to analyze. Also, better to use the date type for a date instead of the integer.
Sorting can be memory expensive if your field you are sorting on has a lot of unique terms. Just make sure you have enough memory for it. Also, keep in mind that sorting on a specific field you throw away the relevance ranking, which is a big part of what a search engine is all about.
Whether you want to store the field too doesn't have anything to do with sorting, but just with the way you retrieve it in order to return it together with your search results. If you use the _source field (default behaviour) there's no reason to store specific fields. If you ask for specific fields using the fields option when querying, then the stored fields would be retrieved directly from lucene rather than extracted from the _source field parsing the json.

An index is used for efficient sorting. So YES, you want to create an index for the field.
As to needing it to be "more efficient", I'd kindly advise you to first check your results and see if they're fast enough. I don't see a reason beforehand (with the limited info you provided) to think it wouldn't be efficient.
If you intend to filter on the field as well (date-ranges?) be sure to use filters instead of queries whenever you feel the filters used will be used often. This because filters can be efficiently cached.

Related

analyzed field vs doc_values: true field

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL with path: /part1/user#site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user#site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user#site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Thanks for your help!
Your question is about a field where you need doc_values but can not index with keyword-analyzer.
You did not mention why you need doc_values. But you did mention that you currently not search in this field.
So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.
It seems that no-one has actually performance tested the two options, so I did.
I took a sample of 10 million documents and created two new indices:
An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.
I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.
Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.
The results were as follows:
Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.
It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.
Update
Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.

How to change field value in elasticsearch from string to integer?

I have some data indexed in elasticsearch, in _source I have a field to store file size:
{"file_size":"25.2MB"}
{"file_size":"2GB"}
{"file_size":"800KB"}
Currently the mapping of this field is string. I want to do search with sorting by file_size. I guess I need change the mapping to integer and do re-index.
How can I calculate the size in bytes and re-index them as integer?
Elasticsearch does not support field reindexing, as documents in lucenes index is immutable. So, internally, every document need to be fetched, changed, indexed back to index and old copy should be removed. Its doesn't matter what you actually need - change mapping or change data.
So, about practical part. Straightforward way:
Create new index with proper mapping
Fetch all your documents from old index
Change your file_size field to integer according to any logic you need
Index documents to new index
Drop old index after full migration
So, application side will contain additional logic to transform data from human-readable strings to Long + standard ES driver functionality. To speed this process up, consider using scroll-scan for read and bulk api for write. For future, I recommend using aliases to be able to migrate your data seamlessly.
In case, when you can't do server-side changes for some reason, you can potentially add new field with proper mapping and fire up ES-side updates with scripted partial updates (). Or try your luck with experimental plugin
why not use sort by keyword?
just add this:
{
"sort": {
"file_size.keyword": {
"order": "asc"
}
}
}
it was only sort it by string, so if there is data 2.5GB, 1KB, 5KB, the data will be 1KB, 2.5GB, 5KB
i think you have to save it into Byte first, so you can easily sorting it if it was in the same format.

Cost of adding field mapping in elasticsearch type

I have a use-case, where I have got a set of predefined fields and also need to support adding dynamic fields to ElasticSearch with some basic searching on them. I am able to achieve this using dynamic template mapping. However, the frequency of adding such dynamic fields is quite high.
Consider the this ES document for the Event type:
{
"name":"Youth Conference",
"venue":"Ahmedabad",
"date":"10/01/2015",
"organizer":"Invincible",
"extensions":{
"about": {
"vision":"Visualizes the image of an ideal Country. ",
"mission":"Encapsulates the gravity of the top reformative solutions for betterment of Country."
}
// Any thing can go here..
}
}
In the example above, each event document may have any unknown/new fields. Hence, for every such new dynamic field introduced, ES will update the mapping of the type. My concern is what is the cost of adding new field mapping in the existing type?
I am planning to separate out all dynamic mappings(inside extensions) from Event type by introducing another type, say EventExtensions and using parent/child relationship to map it with Event type. I believe this may limit the cost(if any) of adding dynamic fields frequently to the type. However, to my knowledge, using parent/child relationship will need more memory.
The first thing to remember here is that field is per index and not per type.
So wherever you add new fields , it would be made in the same index. Be it , in another type or as parent or child.
So decoupling the new fields to another type but same index is not going to make any change.
Second field addition is not that very expensive thing. I know people who uses 1000 of fields and are good with it. That being said , there should be a tab on number of field so that it wont go out to crazy numbers.
Here we have multiple approaches to solve the problem
1) Lets assume that the new field data need not be exactly searchable. In this case , you can deserialize the entire JSON as a string and add it to a field. Also make sure this field is not indexed. This way you can search based on other fields but then on retrieval of the document , get the information that was deserialized.
2) Lets say the new field looks like this
{
"newInfo1" : "log Of Info",
"newInfo2" : "A lot more info"
}
Instead of this , you can use
{
"newInfo" : [
{
"fieldName" : "newInfo1",
"fieldValue" : "log Of Info"
},
{
"fieldName" : "newInfo2",
"fieldValue" : "A lot more info"
}
]
}
This way , fields wont increase. But then to make field level specific search , like give me all documents with filedName as newInfo2 and having the word more in it , you will need to make newInfo field nested.
Hope this helps.

Is there a need for a boolean field index in elasticsearch

I am trying to optimize my elasticsearch.
I have several boolean fields, which I use queries with.
I could dispense with them, but that would give my client side a hard time.
My question is whether or not setting those fields to "index":"yes" will actually have a significant negative effect on my index's performance, such as indexing time and size (other than the obvious "store" space it would take)?
Does a boolean indexed field really take up more space? It seems it shouldn't. Moreover, I don't see any benefit in creating such an index for any DB, not only elasticsearch.
But, I have to specify "index":"yes" to be able to filter by it, right?
If you want to search against a field you have to index it. By default a boolean field is indexed, and will take a small amount of space to do so. There will be a list of docs where "myfield": true and "myfield": false.
If you didn't want to maintain this index, then when you wanted to find docs where "myfield": true you would have to through every doc to check the field.
If you don't want to search/filter with that field, by all means set "index": "no", just be warned you will need to re-index everything if you change your mind about this field in the future!
Have a look at the elasticsearch docs on mappings; the core types section, scroll down to the boolean type.

What indexes are created when indexing a document in elasticsearch

If I create a first document of it's type, or put a mapping, is an index created for each field?
Obviously if i set "index" to "analyzed" or "not analyzed" the field is indexed.
Is there a way to store a field so it can be retrieved but never searched by? I imagine this will save a lot of space? If I set this to "no" will this save space?
Will I still be able to search by this, just take more time, or will this be totally unsearchable?
Is there a way to make a field indexed after some documents are inserted and I change my mind?
For example, I might have a mapping:
{
"book":{"properties":{
"title":{"type":"string", "index":"not_analyzed"},
"shelf":{"type":"long","index":"no"}
}}}
so I want to be able to search by title, but also retrieve the shelf the book is on
index:no will indeed not create an index for that field, so that saves some space. Once you've done that you can't search for that particular field anymore.
Perhaps also useful in this context is to know aboutthe _source field, which is returned by default and includes all fields you've stored. http://www.elasticsearch.org/guide/reference/mapping/source-field/
As to your second question:
you can't change your mind halfway. When you want to index a particular field later on you have to reindex the documents.
That's why you may want to reconsider setting index:no, etc. In fact a good strategy to begin is to don't define a schema for fields at all, unless you're 100% sure you need a non-default analyzer for a particular field for instance. Otherwise ES will use generally usable defaults.

Resources