Max length of Elasticsearch field names - elasticsearch

Is there any restriction of the length of field names in JSON documents stored in Elasticsearch?
The maximum length of Elasticsearch indices is 255 characters. But I didn't find any restrictions of field names.

Yes, for indices it's currently set to 255.
I couldn't find a limit for field names in a quick search through the source code. Also tried more than 6,000 and that worked without problems. So I guess for regular use cases you should be ok ;-)

Related

How to calculate Elasticsearch Field Size

The question, is there a way to calculate the most expensive field in a Elasticsearch index.
AIM is to calculate and compare the storage and index size of two fields in a elasticsearch Index.
Also is it wise to use dual type fields?
like a string in elasticsearch has text field which is searchable and .keyword field which is aggregatable
Will it use double the storage and index space?
is it wise to use dual-type fields. Like a string in elasticsearch has text field which is searchable and .keyword field which is aggregatable
It totally depends on the use case. Maintain both keyword & text representation of a field value if :
a) You need advance searching capability on the field
b) Either your current or future requirements requires capability to either sort or aggregate on the field.
In real life i have seen for short text fields like 'name', 'business-name','tag' etc it makes sense to maintain both. But for larger texts e.g description i don't think there are use cases for aggregation & sorting (in general).

ElasticSearch and Searching in Arrays

We have an ES index which has a field which stores its data as an array. In this field, we include the original text, plus text without any punctuation, special characters, etc. The problem is, when searching on the field, the multiple values appears to be skewing the score.
For example, if we search on the term 'up', the document which has the array ['up, up and away', 'up up and away'] is scoring higher with a multi_match (we are using because we may search more than one field) than the document with the array as simply ['up'].
In the end, I guess what I am looking for is a score that emulates calculating a score for each item in the array and returning me the highest. I believe in this case, comparing 'up' to 'Up' and 'Up, Up and Away' will give me a higher score for 'Up'.
With my research, I believe I may need to do custom scoring on this field...? If that is true, am I looking at "score_mode": "max" as what I want?
I think you slightly over-engineered your index. You don't need to create duplicate fields for the same information and remove punctuation, lowercase fields yourself.
I'd recommend you to read what are elasticsearch token filters and how to create multiple analyzers for the same field.
For your exact use case, if you provided a document sample, it would certainly help. But in any case looking at what you are dealing with - index your array of strings with default analyzer and with a custom one that you'll build yourself. Then you can use the same field, but with different analyzers (differently processed text) to control your score.

Elasticsearch query on string representation of number

Good day:
I have an indexed field called amount, which is of string type. The value of amount can be either one or 1. Say in this example, we have amount=1 as an indexed document but, I try to search for one, ElasticSearch will not return the value unless I put 1 for the search query. Thoughts on how I can get this to work? I'm thinking a tokenizer is what's needed.
Thanks.
You probably don't want this for sevenmillionfourhundredfifteenthousendtwohundredfourteen and the like, but only for a small number of values.
At index time I would convert everything to a proper number and store it in a numerical field, which then even allows to sort --- if you need it. Apart from this I would use synonyms at index and at query time and map everything to the digit-strings, but in a general text field that is searched by default.

Get top 10 most used words in text fields

I have an index containing thousands of documents, each one of them having a full text field.
I want to search through all those fields and fetch the 10 most common words that come back most often.
I would also like a way of visualizing it on Kibana if that's possible.
The most common way to achieve that is to duplicate your full text field with a keyword datatype. That will get you able to make terms aggregation on that field - doc here. Maybe you could consider to do a significant term aggregation - doc here, thus to avoid the presence of stopwords and common words. In ES 6.x you could use also the significant text aggregation - doc here, without create the keyword field, but i never try it, i don't know how it works. Instead if you need to retrieve the frequency of the words for each document, you should use the termvector - doc here

analyzed field vs doc_values: true field

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL with path: /part1/user#site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user#site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user#site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Thanks for your help!
Your question is about a field where you need doc_values but can not index with keyword-analyzer.
You did not mention why you need doc_values. But you did mention that you currently not search in this field.
So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.
It seems that no-one has actually performance tested the two options, so I did.
I took a sample of 10 million documents and created two new indices:
An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.
I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.
Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.
The results were as follows:
Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.
It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.
Update
Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.

Resources