What does it mean when it appears <propertie>.keyword in Elasticsearch? - elasticsearch

I am a beginner in the world of Elasticsearch and And I don't know what it means when a property called .keyword. It only appears when I'm in the section "Management" > "Index Pattern".
Only those properties (propertie.keyword) have option 'aggregatable' active.
What's the difference between 'locality' and 'locality.keyword'.
And I don't get the same result when I do
{'match': {'locality': "Sant Climent"}}
or
{'match': {'locality.keyword': "Sant Climent"}}
Someone could explain to me the difference and what each thing is used for? I'm going crazy.
(I'm using the latest version of Elasticsearch BTW, 6.X).

Your field is indexed two times: one time locality is indexed with a text datatype and is used to perform fulltext search. Your text is subvided into tokens and you can apply a filter transformation and retrieve every single word. E.G. Once you have tokenized a text you can apply a stopword list to remove unmeaningful words or a stemmer, giving the possibility to retrieve a word from its stem. You could query against this field with the match query. locality.keyword is the same content indexed as keyword datatype. That it means that your text is considered as a unique token and you can retrieve your content only with a literal search on that field - if you don't provide a keyword normalizer is also case sensitive!, no fulltext. You could query against this field with the term query. For that reason you could make aggregation only on the keyword field content: the server could make a kind of groupby on the field, because its content is as is it, and is not composed by differents units - the tokens.If you have any other doubt , please ask for help :-)

You need to understand the text datatype and keyword datatype.
You can map fields as text and keyword combined like this:
{
"properties":{
"locality":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
... other fields here
}
When doing this, you can now query the field locality as text OR you could use locality.keyword to query it as keyword.

Related

Type of field for prefix search in Elastic Search

I'm confused on what index type I should apply for my field for prefix search, many show search_as_you_type but I think auto complete is not what I'm going for.
I have a UUID field:
id: 34y72ca1-3739-41ff-bbec-f6d17479384c
The following terms should return the doc above:
3
34
34y72ca1
34y72ca1-3739
34y72ca1-3739-41ff-bbec-f6d17479384c
Using 3739 should not return it as it doesn't start with 3739. Initially this is what I was going for but then the wildcard field is not supported by Amazon AWS, so I compromise for prefix search instead of partial search.
I tried search_as_you_type field but it doesn't return the result when I use the whole ID. Actually, my use case is when user click enter, the results will be shown, instead of real-live when they type, so if speed is compromised its OK, just that I hope for something that will be good for many rows of data.
Thanks
If you have not explicitly defined any index mapping, then you need to use id.keyword field instead of the id field for the prefix query to show the appropriate results. This uses the keyword analyzer instead of the standard analyzer
{
"query": {
"prefix": {
"id.keyword": {
"value": "34y72ca1"
}
}
}
}
Otherwise, you can modify your index mapping, by adding multi fields for id field

Elastic search wildcard search space issue

Consider index field "ProductName" having the value "dove 3.75oz" and when user searches for "dove 3.75oz" text below bool query is working fine to retreive the document:
{"bool":{"must":[{"wildcard":{"ProductName":{"value":"dove"}}},{"wildcard":{"ProductName":{"value":"3.75oz"}}}]}}
If user searches for "dove 3.75 oz" (Space between "3.75" and "oz") the bool query is failing to retrieve the same document:
{"bool":{"must":[{"wildcard":{"ProductName":{"value":"dove"}}},{"wildcard":{"ProductName":{"value":"3.75 oz"}}}]}}
Question: How to design a query using a wildcard query that supports space or no spaces? Please share an example.
Text fields values are broken into tokens by default and then stored. So something like "hello man"" will be saved separately as hello and man because of the space between them. And that is exactly why this will not work with a wildcard query.
{"wildcard":{"ProductName":{"value":"3.75 oz"}}}
It only works for single tokens. For wildcard queries you can use a special field type called wildcard.
If you do not want to reindex your data, try phrase search like:
"match_phrase": {
"ProductName": {
"query": "3.75 oz"
}
}

How to compare two text fields in elastic search

I have a need to compare two text fields in elastic search, but they are text fields.
For normal fields I can use script to compare using doc['field'].value, but is there a way to do the same for text fields.
See below excerpt from ES DOCS :
By far the fastest most efficient way to access a field value from a script is to use the doc['field_name'] syntax, which retrieves the field value from doc values. Doc values are a columnar field value store, enabled by default on all fields except for analyzed text fields.
There are 2 ways known to me to access a text value from script.
Map a keyword representation of a text field as well and access that field.
{
"mappings": {
"properties": {
"name":{
"type": "text",
"fields": {
"keyword":{ // <======= See this
"type":"keyword"
}
}
}
}
}
}
The keyword representation can be accessed like 'doc[name.keyword].value'
It is recommended to index/store keyword representation of fields for small-size text fields like 'name', 'emailId' but is not recommended for larger fields like 'description', due to memory overhead
Another way is to enable field data on the text field and access that field.
Fielddata is disabled on text fields by default. Set fielddata=true on
[your_field_name] in order to load fielddata in memory by uninverting the
inverted index. Note that this can however use significant memory.
It is not recommended to use the field-data over the text fields however.
Note: Please do add details on why you need comparison and what kind of comparison is required on 'text' fields.

analyzed vs not_analyzed: storage size

I recently started using ElasticSearch 2. And as I undestand analyzed vs not_analyzed in the mapping, not_analyzed should be better in storage (https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0 and https://www.elastic.co/blog/elasticsearch-storage-the-true-story).
For testing purposes I created some indexes with all the String field as analyzed (by default) and then I created some other indexes with all the fields as not_analyzed, my surprise came when I checked the size of the indexes and I saw that the indexes with the not_analyzed Strings were 40% bigger!! I was inserting the same documents in each index (35000 docs).
Any idea why this is happening? My documents are simple JSON documents. I have 60 String fields in each document that I want to set as not_analyzed and I tried both setting each field as not analyzed and also creating a dynamic template.
I edit for adding the mapping, although I think it has nothing special:
{
"mappings": {
"my_type" : {
"_ttl" : { "enabled" : true, "default" : "7d" },
"properties" : {
"field1" : {
"properties" : {
"field2" : {
"type" : "string", "index" : "not_analyzed"
}
more not_analyzed String fields here
...
...
...
}
not_analyzed fields are still indexed. They just don't have any transformations applied to them beforehand ("analysis" - in Lucene parlance).
As an example:
(Doc 1) "The quick brown fox jumped over the lazy dog"
(Doc 2) "Lazy like the fox"
Simplified postings list created by Standard Analyzer (default for analyzed string fields - tokenized, lowercased, stopwords removed):
"brown": [1]
"dog": [1]
"fox": [1,2]
"jumped": [1]
"lazy": [1,2]
"over": [1]
"quick": [1]
30 characters worth of string data
Simplified postings list created by "index": "not_analyzed":
"The quick brown fox jumped over the lazy dog": [1]
"Lazy like the fox": [2]
62 characters worth of string data
Analysis causes input to get tokenized and normalized for the purpose of being able to look up documents using a term.
But as a result, the unit of text is reduced to a normalized term (vs an entire field with not_analyzed), and all the redundant (normalized) terms across all documents are collapsed into a single logical list saving you all the space that would normally be consumed by repeated terms and stopwords.
From the documentation, it looks like not_analyzed makes the field act like a "keyword" instead of a "full-text" field -- let's compare these two!
Full text
These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed.
Keyword
Keyword fields are not_analyzed. Instead, the exact string value is added to the index as a single term.
I'm not surprised that storing an entire string as a term, rather than breaking it into a list of terms, doesn't necessarily translate to saved space. Honestly, it probably depends on the index's analyzer and the string being indexed.
As a side note, I just re-indexed about a million documents of production data and cut our index disk space usage by ~95%. The main difference I made was modifying what was actually saved in the source (AKA stored). We indexed PDFs for searching, but did not need them to be returned and so that saved us from saving this information in two different ways (analyzed and raw). There are some very real downsides to this, though, so be careful!
Doc1{
"name":"my name is mayank kumar"
}
Doc2.{
"name":"mayank"
}
Doc3.{
"name":"Mayank"
}
We have 3 documents.
So if field 'name' is 'not_analyzed' and we search for 'mayank' only second document would be returned.If we search for 'Mayank' only third document would be returned.
If field 'name' is 'analyzed' by a analyser 'lowercase analyser'(just as a example).We we search for 'mayank', all 3 documents would be returned.
If we search for 'kumar' ,first document would be returned.This happens because in first document the field value gets tokenised as "my" "name" "is" "mayank" "kumar"
'not_analyzed' is basically used for 'full-text' search(mostly except in wildcards matching).less space on disk.Takes less time during indexing.
'analyzed' is basically used for matching documents.more space on disk (if the analyze fields are big).Takes more time during indexing.(More fields due to analyze fields)

How to search fields with '-' characters in elastic search

I am new to elastic search. I have got following document where one of the field "eventId" has "-" in value.
When i try to search with complete value of eventId, i don't get any results.
Sample Document app/event
{
"tags": {}
"eventId": "cc98d57b-c6bc-424c-b54c-df1e3df0d942",
}
I haven't created any explicit settings for my index.
Thanks.
you should check if the tokenizer splits your value into multiple fields. Maybe your value is stored as 5 fields: "cc98d57b", "c6bc", "424c", "b54c" and "df1e3df0d942"
You can analyze that with the 'Kopf' Plugin (https://github.com/lmenezes/elasticsearch-kopf).
If that is your problem you should change your field mapping, so that the value is not analyzed ("index" : "not_analyzed").
For an example how to set that mapping see here: Elasticsearch mapping settings 'not_analyzed' and grouping by field in Java
After that, you should be able to search for your specific value.

Resources