I'm trying to write a query to will give me all the documents where the field "id" is of the form: "SOMETHING-SOMETHING-4SOMETHING-SOMETHING-SOMETHING"
For instance, ab-ba-4a-b-a is a valid id.
I wrote this query
"query":
{
"regexp":
{
"id":
{
"value": ".*-.*-4.*-.*-.*"
}
}
}
It gets no hits. What's wrong with this? I can see many ids of this form.
If the id field is of type keyword the regexp should be working fine.
However if it is of type text, notice how elasticsearch stores the token internally.
POST /_analyze
{
"text": "abc-abc-4bc-abc-abc",
"analyzer": "standard"
}
Response:
{
"tokens" : [
{
"token" : "abc",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "abc",
"start_offset" : 4,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "4bc",
"start_offset" : 8,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "abc",
"start_offset" : 12,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "abc",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
Notice that it breaks down the token abc-abc-4abc-abc-abc into 5 strings. Take a look at what Analysis and Analyzers are and how they are only applied on text fields.
However, keyword datatype has been created only for the cases where you do not want your text to be analyzed (i.e. broken into tokens and stored in inverted indexes) and stores the string value as it is internally.
Now just in case if your mapping is dynamic, ES by default creates two different fields for string values. a text and its keyword sibling, something like below:
{
"mappings" : {
"properties" : {
"id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
In that case, just apply the query you have on id.keyword field.
POST <your_index_name>/_search
{
"query": {
"regexp": {
"id.keyword": ".*-.*-4.*-.*-.*"
}
}
}
Hope that helps!
Related
It seems like there is a character minimum needed to get results with elasticsearch for a specific property I am searching. It is called 'guid' and has the following configuration:
"guid": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
I have a document with the following GUID: 3e49996c-1dd8-4230-8f6f-abe4236a6fc4
The following query returns the document as-expected:
{"match":{"query":"9996c-1dd8*","fields":["guid"]}}
However this query does not:
{"match":{"query":"9996c-1dd*","fields":["guid"]}}
I have the same result with multi_match and query_string queries. I haven't been able to find anything in the documentation about a character minimum, so what is happening here?
Elastic does not require a minimum number of characters. What matters is the generated token.
An exercise that helps to understand is to use _analyzer to see your index tokens.
GET index_001/_analyze
{
"field": "guid",
"text": [
"3e49996c-1dd8-4230-8f6f-abe4236a6fc4"
]
}
You indicate the term 3e49996c-1dd8-4230-8f6f-abe4236a6fc4.
Look how the tokens are:
"tokens" : [
{
"token" : "3e49996c",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1dd8",
"start_offset" : 9,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "4230",
"start_offset" : 14,
"end_offset" : 18,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "8f6f",
"start_offset" : 19,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "abe4236a6fc4",
"start_offset" : 24,
"end_offset" : 36,
"type" : "<ALPHANUM>",
"position" : 4
}
]
When you perform the search, the same analyzer that is used in the indexing will be used in the search.
When you search for the term "9996c-1dd8*".
GET index_001/_analyze
{
"field": "guid",
"text": [
"9996c-1dd8*"
]
}
The generated tokens are:
{
"tokens" : [
{
"token" : "9996c",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1dd8",
"start_offset" : 6,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Note that the inverted index will have the token 1dd8 and the term "9996c-1dd8*" generated the token "1dd8" so the match took place.
When you test with the term "9996c-1dd*", no tokens match, so there are no results.
GET index_001/_analyze
{
"field": "guid",
"text": [
"9996c-1dd*"
]
}
Tokens:
{
"tokens" : [
{
"token" : "9996c",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1dd",
"start_offset" : 6,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Token "1dd" is not equal to "1dd8".
I have a field named log.file.path in Elasticsearch and it has /var/log/dev-collateral/uaa.2020-09-26.log value, I tried to retrieve all logs that log.file.path field starts with /var/log/dev-collateral/uaa
I used the below regexp but it doesn't work.
{
"regexp":{
"log.file.path": "/var/log/dev-collateral/uaa.*"
}
}
Let's see why it is not working? I've indexed two documents using Kibana UI like below -
PUT myindex/_doc/1
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.log"
}
PUT myindex/_doc/2
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.txt"
}
When I try to see the tokens for of the text on log.file.path field using _analyze API
POST _analyze
{
"text": "/var/log/dev-collateral/uaa.2020-09-26.log"
}
It gives me,
{
"tokens" : [
{
"token" : "var",
"start_offset" : 1,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "log",
"start_offset" : 5,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "dev",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "collateral",
"start_offset" : 13,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "uaa",
"start_offset" : 24,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "2020",
"start_offset" : 28,
"end_offset" : 32,
"type" : "<NUM>",
"position" : 5
},
{
"token" : "09",
"start_offset" : 33,
"end_offset" : 35,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "26",
"start_offset" : 36,
"end_offset" : 38,
"type" : "<NUM>",
"position" : 7
},
{
"token" : "log",
"start_offset" : 39,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 8
}
]
}
You can see, Elasticsearch has split your input text into tokens when you insert them on your index. This is because elasticsearch uses standard analyzer when we index documents and it splits our document to small parts as a token, remove punctuations, lowercased text etc. That's whey your current regexp query doesn't work.
GET myindex/_search
{
"query": {
"match": {
"log.file.path": "var"
}
}
}
If you try this way it will work but for your case, you need to match every log.file.path that ends with .log So what do now? Just don't apply analyzers while indexing documents. The keyword type stores the string you provide as it is.
Create mapping with keyword type,
PUT myindex2/
{
"mappings": {
"properties": {
"log.file.path": {
"type": "keyword"
}
}
}
}
Index documents,
PUT myindex2/_doc/1
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.log"
}
PUT myindex2/_doc/2
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.txt"
}
Search with regexp,
GET myindex2/_search
{
"query": {
"regexp": {
"log.file.path": "/var/log/dev-collateral/uaa.2020-09-26.*"
}
}
}
I used this query and it works!
{
"query": {
"regexp": {
"log.file.path.keyword": {
"value": "/var/log/dev-collateral/uaa.*",
"flags": "ALL",
"max_determinized_states": 10000,
"rewrite": "constant_score"
}
}
}
}
how can I map a word to another word in Elasticsearch?. That is suppose I have the following data document
{
"carName" : "Porche"
"review": " this car is so awesome"
}
Now when I search good/fantastic etc, it should map to "awesome".
Is there any way I can do this in elasticsearch?
Yes, you can achieve this by using a synonym token filter.
First you need to define a new custom analyzer in your index and use that analyzer in your mapping.
curl -XPUT localhost:9200/cars -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"synonyms"
]
}
},
"filter": {
"synonyms": {
"type": "synonym",
"synonyms": [
"good, awesome, fantastic"
]
}
}
}
},
"mappings": {
"car": {
"properties": {
"carName": {
"type": "string"
},
"review": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
You can add as many synonyms as you want, either in the settings directly or in a separate file that you can reference in the settings using the synonyms_path property.
Then we can index your sample document above:
curl -XPUT localhost:9200/cars/car/1 -d '{
"carName": "Porche",
"review": " this car is so awesome"
}'
What is going to happen is that when the synonyms token filter kicks in, it will also index the tokens good and fantastic along with awesome so that you can search and find that document by those tokens as well. Concretely, analyzing the sentence this car is so awesome...
curl -XGET 'localhost:9200/cars/_analyze?analyzer=my_analyzer&pretty' -d 'this car is so awesome'
...will produce the following tokens (see the last three tokens)
{
"tokens" : [ {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "car",
"start_offset" : 5,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "is",
"start_offset" : 9,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "so",
"start_offset" : 12,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "good",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
}, {
"token" : "awesome",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
}, {
"token" : "fantastic",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
} ]
}
Finally, you can search like this and the document will be retrieved:
curl -XGET localhost:9200/cars/car/_search?q=review:good
Using elasticsearch for searching our documents we discovered that when we search for "wave board" we get no good results, because documents containing "waveboard" are not at the top of the results. Google does this kind of "term combining". Is there a simple way to do this in ES?
Found a good solution: Create a custom anaylzer with a shingle filter using "" as a token separator and use that in a query (use bool query to combine with standard queries)
To do this at analysis time, you can also use what is know as a "decompounding"
token filter. Here is an example to decompound the text "catdogmouse" into the
tokens "cat", "dog", and "mouse":
POST /decom
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"decom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["decom_filter"]
}
},
"filter": {
"decom_filter": {
"type": "dictionary_decompounder",
"word_list": ["cat", "dog", "mouse"]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"body": {
"type": "string",
"analyzer": "decom_analyzer"
}
}
}
}
}
And then you can see how they are applied to certain terms:
POST /decom/_analyze?field=body&pretty
racecatthings
{
"tokens" : [ {
"token" : "racecatthings",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
And another: (you should be able to extrapolate this to separate "waveboard"
into "wave" and "board")
POST /decom/_analyze?field=body&pretty
catdogmouse
{
"tokens" : [ {
"token" : "catdogmouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "dog",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "mouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
I'd like to concatenate words then ngram it.
What's the correct setting for elasticsearch?
In english,
from: stack overflow
==> stackoverflow : concatenate first,
==> sta / tac / ack / cko / kov / ... and etc (min_gram: 3, max_gram: 10)
To do the concatenation I'm assuming that you just want to remove all spaces from your input data. To do this, you need to implement a pattern_replace char filter that replaces space with nothing.
Setting up the ngram tokenizer should be easy - just specify your token min/max lengths.
It's worth adding a lowercase token filter too - to make searching case insensitive.
curl -XPOST localhost:9200/my_index -d '{
"index": {
"analysis": {
"analyzer": {
"my_new_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "my_ngram_tokenizer",
"char_filter" : ["my_pattern"],
"type": "custom"
}
},
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"\u0020",
"replacement":""
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "3",
"max_gram" : "10",
"token_chars": ["letter", "digit", "punctuation", "symbol"]
}
}
}
}
}'
testing this:
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=my_new_analyzer&pretty' -d 'stack overflow'
gives the following (just a small part shown below):
{
"tokens" : [ {
"token" : "sta",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "stac",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
}, {
"token" : "stack",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 3
}, {
"token" : "stacko",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 4
}, {
"token" : "stackov",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 5
}, {