How to filter multiple characters in Elasticsearch - elasticsearch

Is there a way how to filter multiple characters during analyzing in ElastisSearch? We would like to setup it so if user searches 'botled' then he get the documents that include 'bottled' or 'botttled', etc., i.e no matter double, tripple letters.
I have looking for solution in token filters https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html, but it seems that none of them matches our requirements.

by default elasticsearch text field is tokenized based on whitespace, i.e. only words are indexed and are searchable.
would regex search work for you?
GET /_search
{
"query": {
"regexp": {
"user": {
"value": "b+o+t+t+l+e+d+"
}
}
}
}
b+ --> one or more occurrence of b

Related

Elatisearch match_phrase_prefix query, with exact prefix match

I have a match_phrase_prefix query, which works as expected. But when the users passes any special characters at the end of the keyword, ES ignores these characters, and still returns the result.
query{ match_phrase_prefix:{ content: { query: searchTerm } } }
I am using this query to search for prefix. If i pass a term like overflow####!! ES is returning me all the results with the word overflow in it. But instead i want to make an exact prefix match, where the special characters are not ignored. The search term could be of multiple words as well stack overflow search.
How could i make ES search of prefix_match without ignoring the special_chars.
You can use keyword analyzer when defining your query.
{
"query": {
"match_phrase_prefix": {
"content": {
"query": "overflow####!!",
"analyzer": "keyword"
}
}
}
}

Is there a way to get ElasticSearch to create n-gram tokens from truncated field?

Documents contain a url field with a full url. Users should be able to search for documents containing a given url by supplying a portion of the url string. The search string can be 3-15 characters long. An N-gram token filter with min_gram of 3 and max_gram of 15 would work but generates a large number of tokens for long urls. Is it possible to have ElasticSearch only generate tokens for the first 100 characters of the url field?
For example, the user should be able to search for documents containing the following url using a search string such as ’example.com’ or ‘/foo/bar’.
https://click.example.com/foo/bar/55gft/?qs=1952934d0ee8e2368ec7f7a921e3c6202b39365b9a2d26774c8122b8555ca21fce9d2344fc08a8ba40caede5e6901a112c6e89ead40892109eb8290d70571eab
There are two ways to achieve what you want.
Option 1: Keep using ngrams as you do now, but insert a truncate token filter before the ngram one, to limit the url size to 100 and only after ngram it.
Option 2: Use the wildcard field type, which has been created exactly for cases like this.
In your index, you should first change the type of the URL field to wildcard:
PUT test
{
"mappings": {
"properties": {
"url": {
"type": "wildcard"
}
}
}
}
Then, you can search on that field, using the wildcard query, like this:
POST test/_search
{
"query": {
"wildcard": {
"url": "*foo/bar*"
}
}
}
Also, read the related blog post which shows in details how the wildcard field type performs.

In Elasticsearch match query how to deal with slash

I have a match query searching for a type of doc:
{
"query": {
"bool": {
"should": {
"match": {
"ph1_enc": "EAAQnb1kMr/e2/ADqo"
}
}
}
}
}
"EAAQnb1kMr/e2/ADqo" is the string i'm trying to match, however in the search results I can see multiple records with substring "/e2/" are also returned.
Looks like "/e2/" is indexed separately, so that this could happen.I thought the match query is to do full-text match... Is it because I missed something when creating the template? Any idea?
Add-on instead of reindex, how to modify the query to match the exact value in the query?
Which analyzer do you set in the mapping to index your data?
If you are using the default one (standard analyzer), then according to the documentation, this uses the default tokenizer that seems to split also the text by slash ('/'). The documentation redirects here for more information about the tokenizer.
So, that will index the following words 'EAAQnb1kMr', 'e2', and 'ADqo'. Accordingly, your query value will also been analyzed the same way the field was indexed. That is why documents with 'e2' are also being returned.
If you don't need to tokenize the 'ph1_enc' field, you can just set its type in the mapping as 'keyword'.
"properties": {
"ph1_enc": {
"type": "keyword"
}
}
That will not analyze the field and it will match exactly while you query.
I hope that it helps.

Is the text provided to a wildcard query analyzed?

I plan to use wildcard queries on analyzed text fields using asciifolding (to get rid of french accents) and lowercase.
My first tests show e.g.
matches for "wildcard": { "ar_titre.raw": { "value": "nomme*" } } but no matches for "wildcard": { "ar_titre.raw": { "value": "nommé*" } }
Does that mean that when using wildcard (or prefix) queries, the text provided to "value" is not analyzed ? Or Is that a bug ?
wildcard queries are term-level queries.
As explained in the official documentation, the wildcard expression is not analyzed:
Matches documents that have fields matching a wildcard expression (not analyzed).

How to deal with punctuation in an ElasticSearch field

I have a field in a document stored in Elastic Search, which I want to be analyzed as a full text field. In one case, it contains a value for the name field like this:
A&B Corp
I want to be able to search the documents for an auto-complete widget, using a query like this (suppose the user typed A&B into the autocomplete field). The intention is to match documents that contain the any terms with the typed prefix.
{ "query": {
"filtered": {
"query": {
"query_string": {
"query": "A&B*",
"fields": [
"firstName",
"lastName",
"name",
"key",
"email"
]
}
},
"filter": {
"terms": {
"environmentId": [
"foo"
]
}
}
}
}
}
```
My mapping for the name field looks like this:
"name": {
"type": "string"
},
But, I get no results. The query structure works for documents that don't have & in the field, so I'm pretty sure that is part of the problem.
But, I'm not sure how to deal with this. I am pretty sure I still want to analyze the field for full text search.
In addition, if I add a space before the * in the query (ie, "query": "A&B *",) then I get results including A&B, so I don't think it is just discarding the ampersand and treating the A and B as separate terms.
Should I change my mapping? The query?
The Query_string query has a set of reserved characters that needs to be escaped.
query_string : Read the reserved characters section
So to search for
'A&B' (or) 'A&B Corp' (or) 'A&B....'
Your query must be "A&B\\*" such that the query_string parser treats
it as a * wildcard operator.
While currently your query is searching for exact match of
"A&B*" it expects asterik to be part of your data.
And when you search "A&B *" the whitespace is a reserved
character so its
now searching for "A&B" (or) "*" and hence you get a match in this
case.

Resources