removing special characters and words from a url elasticsearch - elasticsearch

I was looking for a way to generate words and special characters as tokens from a url.
eg. I have a url https://www.google.com/
I want to generate tokens in elastic as https, www,google, com, :, /, /, ., ., /

You can define custom analyzer with letter tokenizer as shown below:
PUT index3
{
"settings": {
"analysis": {
"analyzer": {
"my_email": {
"tokenizer": "letter",
"filter": [
"lowercase"
]
}
}
}
}
}
Test API:
POST index3/_analyze
{
"text": [
"https://www.google.com/"
],
"analyzer": "my_email"
}
Output:
{
"tokens" : [
{
"token" : "https",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "www",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "google",
"start_offset" : 12,
"end_offset" : 18,
"type" : "word",
"position" : 2
},
{
"token" : "com",
"start_offset" : 19,
"end_offset" : 22,
"type" : "word",
"position" : 3
}
]
}

Related

Why do elasticsearch queries require a certain number of characters to return results?

It seems like there is a character minimum needed to get results with elasticsearch for a specific property I am searching. It is called 'guid' and has the following configuration:
"guid": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
I have a document with the following GUID: 3e49996c-1dd8-4230-8f6f-abe4236a6fc4
The following query returns the document as-expected:
{"match":{"query":"9996c-1dd8*","fields":["guid"]}}
However this query does not:
{"match":{"query":"9996c-1dd*","fields":["guid"]}}
I have the same result with multi_match and query_string queries. I haven't been able to find anything in the documentation about a character minimum, so what is happening here?
Elastic does not require a minimum number of characters. What matters is the generated token.
An exercise that helps to understand is to use _analyzer to see your index tokens.
GET index_001/_analyze
{
"field": "guid",
"text": [
"3e49996c-1dd8-4230-8f6f-abe4236a6fc4"
]
}
You indicate the term 3e49996c-1dd8-4230-8f6f-abe4236a6fc4.
Look how the tokens are:
"tokens" : [
{
"token" : "3e49996c",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1dd8",
"start_offset" : 9,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "4230",
"start_offset" : 14,
"end_offset" : 18,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "8f6f",
"start_offset" : 19,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "abe4236a6fc4",
"start_offset" : 24,
"end_offset" : 36,
"type" : "<ALPHANUM>",
"position" : 4
}
]
When you perform the search, the same analyzer that is used in the indexing will be used in the search.
When you search for the term "9996c-1dd8*".
GET index_001/_analyze
{
"field": "guid",
"text": [
"9996c-1dd8*"
]
}
The generated tokens are:
{
"tokens" : [
{
"token" : "9996c",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1dd8",
"start_offset" : 6,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Note that the inverted index will have the token 1dd8 and the term "9996c-1dd8*" generated the token "1dd8" so the match took place.
When you test with the term "9996c-1dd*", no tokens match, so there are no results.
GET index_001/_analyze
{
"field": "guid",
"text": [
"9996c-1dd*"
]
}
Tokens:
{
"tokens" : [
{
"token" : "9996c",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1dd",
"start_offset" : 6,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Token "1dd" is not equal to "1dd8".

Query in Kibana doesn't return logs with Regexp

I have a field named log.file.path in Elasticsearch and it has /var/log/dev-collateral/uaa.2020-09-26.log value, I tried to retrieve all logs that log.file.path field starts with /var/log/dev-collateral/uaa
I used the below regexp but it doesn't work.
{
"regexp":{
"log.file.path": "/var/log/dev-collateral/uaa.*"
}
}
Let's see why it is not working? I've indexed two documents using Kibana UI like below -
PUT myindex/_doc/1
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.log"
}
PUT myindex/_doc/2
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.txt"
}
When I try to see the tokens for of the text on log.file.path field using _analyze API
POST _analyze
{
"text": "/var/log/dev-collateral/uaa.2020-09-26.log"
}
It gives me,
{
"tokens" : [
{
"token" : "var",
"start_offset" : 1,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "log",
"start_offset" : 5,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "dev",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "collateral",
"start_offset" : 13,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "uaa",
"start_offset" : 24,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "2020",
"start_offset" : 28,
"end_offset" : 32,
"type" : "<NUM>",
"position" : 5
},
{
"token" : "09",
"start_offset" : 33,
"end_offset" : 35,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "26",
"start_offset" : 36,
"end_offset" : 38,
"type" : "<NUM>",
"position" : 7
},
{
"token" : "log",
"start_offset" : 39,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 8
}
]
}
You can see, Elasticsearch has split your input text into tokens when you insert them on your index. This is because elasticsearch uses standard analyzer when we index documents and it splits our document to small parts as a token, remove punctuations, lowercased text etc. That's whey your current regexp query doesn't work.
GET myindex/_search
{
"query": {
"match": {
"log.file.path": "var"
}
}
}
If you try this way it will work but for your case, you need to match every log.file.path that ends with .log So what do now? Just don't apply analyzers while indexing documents. The keyword type stores the string you provide as it is.
Create mapping with keyword type,
PUT myindex2/
{
"mappings": {
"properties": {
"log.file.path": {
"type": "keyword"
}
}
}
}
Index documents,
PUT myindex2/_doc/1
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.log"
}
PUT myindex2/_doc/2
{
"log.file.path" : "/var/log/dev-collateral/uaa.2020-09-26.txt"
}
Search with regexp,
GET myindex2/_search
{
"query": {
"regexp": {
"log.file.path": "/var/log/dev-collateral/uaa.2020-09-26.*"
}
}
}
I used this query and it works!
{
"query": {
"regexp": {
"log.file.path.keyword": {
"value": "/var/log/dev-collateral/uaa.*",
"flags": "ALL",
"max_determinized_states": 10000,
"rewrite": "constant_score"
}
}
}
}

using word_delimiter with edgeNGram ignores Word_Delimiter Token

I have my custom analyzer as below. But I dont understand how to achieve my goal.
My goal is that I want to have whitespace separated inverted index but also I want to have autocomplete feature after user enters min 3 chars. For that I though to combine word_delimiter and edgeNGram tokens as below
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [
"standard",
"lowercase",
"my_word_delimiter",
"my_edge_ngram_analyzer"
],
"type": "custom"
}
},
"filter": {
"my_word_delimiter": {
"catenate_all": true,
"type": "word_delimiter"
},
"my_edge_ngram_analyzer": {
"min_gram": 3,
"max_gram": 10,
"type": "edgeNGram"
}
}
}
}
}
}
This will give result for "Brother TN-200" as below. But I was expecting "tn" to be also in the reverted index as I have word_delimiter token. why is it not in the inverted index? How can I achieve this?
curl -XGET "localhost:9200/myIndex/_analyze?analyzer=my_analyzer&pr
etty=true" -d "Brother TN-200"
{
{
"token" : "bro",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "brot",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "broth",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "brothe",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "brother",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "tn2",
"start_offset" : 22,
"end_offset" : 28,
"type" : "word",
"position" : 3
}, {
"token" : "tn20",
"start_offset" : 22,
"end_offset" : 28,
"type" : "word",
"position" : 3
}, {
"token" : "tn200",
"start_offset" : 22,
"end_offset" : 28,
"type" : "word",
"position" : 3
}, {
"token" : "200",
"start_offset" : 25,
"end_offset" : 28,
"type" : "word",
"position" : 4
}]
}
UPDATE:
of course, if I use "min_gram": 2, "tn" will be in the reverted index but I dont want this because if any other word consists "tn" inside the word, it will appear in the result list.
For example about "hp" keyword. I am getting products for "Hewlett Packard" as my products are like "hp xxx" but I get also a product called "tech hpc". I dont want this product to be displayed until I type "hpc". Thats the reason I set 3.
If i dont use edgeNGram tokenizer but only word_delimiter, "tn" is in the inverted index as Brother TN-200 will be indexed as brother, tn and 200. that's why I expected that word_delimiter makes the "tn" to be in the inverted index. Does it have no use if I use it with edgeNGram? –
In my_edge_ngram_analyzer the min_gram setting is 3 as a result any Token with length less than 3 codepoints would not show up.
You would need to set this to 2 if you would want TN to show up.
Example:
get <my_index>/_analyze?tokenizer=whitespace&filters=my_edge_ngram_analyzer&text=TN
The above call would return 0 tokens.

Combine search terms automatically when with Elasticssearch?

Using elasticsearch for searching our documents we discovered that when we search for "wave board" we get no good results, because documents containing "waveboard" are not at the top of the results. Google does this kind of "term combining". Is there a simple way to do this in ES?
Found a good solution: Create a custom anaylzer with a shingle filter using "" as a token separator and use that in a query (use bool query to combine with standard queries)
To do this at analysis time, you can also use what is know as a "decompounding"
token filter. Here is an example to decompound the text "catdogmouse" into the
tokens "cat", "dog", and "mouse":
POST /decom
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"decom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["decom_filter"]
}
},
"filter": {
"decom_filter": {
"type": "dictionary_decompounder",
"word_list": ["cat", "dog", "mouse"]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"body": {
"type": "string",
"analyzer": "decom_analyzer"
}
}
}
}
}
And then you can see how they are applied to certain terms:
POST /decom/_analyze?field=body&pretty
racecatthings
{
"tokens" : [ {
"token" : "racecatthings",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
And another: (you should be able to extrapolate this to separate "waveboard"
into "wave" and "board")
POST /decom/_analyze?field=body&pretty
catdogmouse
{
"tokens" : [ {
"token" : "catdogmouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "dog",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "mouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}

Elasticsearch, how to concatenate words then ngram it?

I'd like to concatenate words then ngram it.
What's the correct setting for elasticsearch?
In english,
from: stack overflow
==> stackoverflow : concatenate first,
==> sta / tac / ack / cko / kov / ... and etc (min_gram: 3, max_gram: 10)
To do the concatenation I'm assuming that you just want to remove all spaces from your input data. To do this, you need to implement a pattern_replace char filter that replaces space with nothing.
Setting up the ngram tokenizer should be easy - just specify your token min/max lengths.
It's worth adding a lowercase token filter too - to make searching case insensitive.
curl -XPOST localhost:9200/my_index -d '{
"index": {
"analysis": {
"analyzer": {
"my_new_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "my_ngram_tokenizer",
"char_filter" : ["my_pattern"],
"type": "custom"
}
},
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"\u0020",
"replacement":""
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "3",
"max_gram" : "10",
"token_chars": ["letter", "digit", "punctuation", "symbol"]
}
}
}
}
}'
testing this:
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=my_new_analyzer&pretty' -d 'stack overflow'
gives the following (just a small part shown below):
{
"tokens" : [ {
"token" : "sta",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "stac",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
}, {
"token" : "stack",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 3
}, {
"token" : "stacko",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 4
}, {
"token" : "stackov",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 5
}, {

Resources