Ignore leading zeros with Elasticsearch - elasticsearch

I am trying to create a search bar where the most common query will be for a "serviceOrderNo". "serviceOrderNo" is not a number field in the database, it is a string field. Examples:
000000007
000000002
WO0000042
123456789
AllTextss
000000054
000000065
000000874
The most common format is just an integer proceeded by some number of zeros.
How do I set up Elasticsearch so that searching for "65" will match "000000065"? I also want to give precedence to the "serviceOrderNo" field (which I already have working). Here is where I am at right now:
{
"query": {
"multi_match": {
"query": "65",
"fields": ["serviceOrderNo^2", "_all"],
}
}
}

One way of doing this is using the lucene flavour regular exression query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
"query": {
"regexp":{
"serviceOrderNo": "[0]*65"
}
}
Also, the Query String query also supports a small set of special characters, more limited set of regular expression characters, AS WELL AS lucene regular expressions the query would look like this:
https://www.elastic.co/guide/en/elasticsearch/reference/1.x/query-dsl-query-string-query.html
"query": {
"query_string": {
"default_field": "serviceOrderNo",
"query": "0*65"
}
}
These are fairly simple Regular expressions, both saying match the character(s) contained in the brackets [0] or the character 0 unlimited times *.
If you have the ability to reindex, or haven't indexed your data yet, you also have the ability to make this easier on yourself by writing a custom analyzer. Right now, you are using the default analyzer for Strings on your serviceOrderNo field. When you index "serviceOrderNo":"00000065" ES interprets this simply as 00000065.
Your custom analyzer could tokenize this field int both "0000065" and "65", using the same regular expression. The benefit of this is that the Regex only runs once at index time, instead of every time you run your query because ES will search against both "0000065" and "65".
You can also check out the ES website documentation on Analyzers.
"settings":{
"analysis": {
"filter":{
"trimZero": {
"type":"pattern_capture",
"patterns":"^0*([0-9]*$)"
}
},
"analyzer": {
"serviceOrderNo":{
"type":"custom",
"tokenizer":"standard",
"filter":"trimZero"
}
}
}
},
"mappings":{
"serviceorderdto": {
"properties":{
"serviceOrderNo":{
"type":"String",
"analyzer":"serviceOrderNo"
}
}
}
}

One way to do this is to use an ngram token filter so that "12345" gets tokenized as:
[ 1, 2, 3, 4, 5 ]
[ 12, 23, 34, 45 ]
[ 123, 234, 345 ]
[ 12345 ]
When tokenized this way, "65" is a match for "000000065".
To set this up, create a new index that has a custom analyzer that uses an ngram filter:
POST /my-index
{
"mappings": {
"serviceorderdto": {
"properties": {
"serviceOrderNo": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Index some data.
Then run your query:
GET /my-index/_search
{
"query": {
"multi_match": {
"query": "55",
"fields": [
"serviceOrderNo^2",
"_all"
]
}
}
}

Related

Elastic search with fuzziness more than 2 characters (Distance)

I am trying to match text fields. I am expecting results if it has 60% plus matching.
by Fuzziness we can give only 2 distance. With this
Elastic Db has record with description 'theeventsfooddrinks' and i am trying to match 'theeventsfooddrinks123', This doesn't matches.
'theeventsfooddrinks12'=> matches
'theeventsfooddri'=> Doesn't matches
'321eventsfooddrinks'=> Doesn't matches
I want elastic to match it 'eventsfooddrinks'
Any change requiring more than 2 steps is not matching
I think fuzzy queries are inappropriate to your case. Fuzziness is a way to solve problem of little misspellings that human can make while typing his query. Human brain can easily skip substitution of some letter in the middle of word without loosing of overall meaning of phrase. The similar behavior we expect from search engine.
Try to use regular partial maching with ngrams analyzer:
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"trigrams": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trigrams_filter"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "trigrams"
}
}
}
}
}
GET my_index/my_type/_search
{
"query": {
"match": {
"my_field": {
"query": "eventsfooddrinks",
"minimum_should_match": "60%"
}
}
}
}

Elasticsearch does not find characters other than alpha numeric

I am facing a problem of searching some characters others than alphanumeric.
I tried with many analyzers, but think that for my problem the 'whitespace' analyzer fits perfectly.
I've created an index custom_doc and posted a doc
{
"body": "some text with ### hash signs # inside",
}
but I am not able to find this doc by passing hash inside query string
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [
"body"
],
"query": "#",
"analyzer": "whitespace"
}
}
]
}
}
}
However analyze shows it is tokenized correctly
request
{
"analyzer": "whitespace",
"text": "#"
}
result
{
"tokens": [
{
"token": "#",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
}
]
}
There is no custom analyzers, no mappings, no additional filters.
How can I solve the problem? I've checked many similar questions and no improvement. Some people advice to make the field as "not_analyzed" but I still want to have a possibility to use wildcards inside query string, thus changing the field type from "text" to "keyword" is not suitable to me as well. E.g. want this query "so*" to return the posted document.
The problem is that you also need to specify the whitespace analyzer at indexing time. Using it only at search time is not sufficient, because your body of text will have been analyzed by the standard analyzer which has removed the # signs, and thus, you cannot search for them afterwards.
First delete your index and recreate it with the following mapping:
DELETE index
PUT index
{
"mappings": {
"doc": {
"properties": {
"body": {
"type": "text",
"analyzer": "whitespace",
"search_analyzer": "whitespace"
}
}
}
}
}
Then index your document:
PUT index/doc/1
{ "body": "some text with ### hash signs # inside"}
Finally, you can search for the # sign (note that you don't need to specify the whitespace analyzer):
POST index/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [
"body"
],
"query": "#"
}
}
]
}
}
}

Elasticsearch query returning false results when term exceeds ngram length

The requirement is to search partial phrases in a block of text. Most of the words will be standard length. I want to keep the max_gram value down to 10. But there may be the occasional id/code with more characters than that, and these show up if I type in a query where the first 10 characters match, but then the rest don't.
For example, here is the mapping:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
and document:
POST my_index/doc/1
{
"title": "Quick fox with id of ABCDEFGHIJKLMNOP"
}
If I run the query:
POST my_index/doc/_search
{
"query": {
"match_phrase": {
"title": {
"query": "fox wi"
}
}
}
}
It returns the document as expected. However, if I run this:
POST my_index/doc/_search
{
"query": {
"match_phrase": {
"title": {
"query": "ABCDEFGHIJxxx"
}
}
}
}
It also returns the document, when it shouldn't. It will do this if the x's are after the 10th character, but not before it. How can I avoid this?
I am using version 5.
By default, the analyzer that is used at index time is the same analyzer that is used at search time, meaning the edge_ngram analyzer is used on your search term. This is not what you want. You will end up with 10 tokens as the search terms, none of which contain those last 3 characters.
You will want to take a look at the Search Analyzer for your mapping. This documentation points out this specific use case:
Sometimes, though, it can make sense to use a different analyzer at search time, such as when using the edge_ngram tokenizer for autocomplete.
The standard analyzer may suit your needs:
{
...
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}

Elastic search query from SQL statement with multiple WHERE clause

I need Elastic Search Query based on following SQL Statement
SELECT * FROM documents
WHERE (doc_name like "%test%" OR doc_type like "%test%" OR doc_desc like "%test%) AND
user_id = 1 AND doc_category = "Utilities"
It depends on your mapping, but you can start working with something like this:
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"user_id": 1
}
},
{
"term": {
"doc_category": "Utilities"
}
}
]
}
},
"query": {
"multi_match": {
"query": "test",
"fields": ["doc_name", "doc_type", "doc_desc"]
}
}
}
}
Adding to the answer given by jbasko: Doing LIKE queries in elsaticsearch very much depends on your mapping of the document fields. For example if you want the equivalent of LIKE '%test%' in elasticsearch you need to use the ngram tokenizer for it :
{
"settings": {
"analysis": {
"analyzer": {
"some_analyzer_name": {
"tokenizer": "some_tokenizer_name"
}
},
"tokenizer": {
"some_tokenizer_name": {
"type": "ngram",
"min_gram": <minimum number of characters>,
"max_gram": <maximum number of characters>,
"token_chars": [
"letter",
"digit"
]
}
}
}
...
and in the mapping of the fields use the analyzer:
"mapping":{
...
"doc_type" : {
"type" :"string",
"analyzer" : "some_analyzer_name"
},
...
"doc_type" : {
"type" :"string",
"analyzer" : "some_analyzer_name"
},
...
}
A short explanation about ngram, this tokenizer breaks the string in the fields doc_type and the others to small consecutive strings with the amount of characters you defined in settings.
i.e. an ngram with
min_gram : 1
max_gram : 3
on the string "abcd".
You'll get a collection of terms: 'a','ab','abc','b','bc','bcd','c','cd','c'. This terms will be used by elasticsearch to find the correct document using a match (or multi-match) query with inverse index.
For further reading you can search for : mapping,ngram tokenizer and termvectors in elasticsearch wiki.

Autocomplete functionality using elastic search

I have an elastic search index with following documents and I want to have an autocomplete functionality over the specified fields:
mapping: https://gist.github.com/anonymous/0609b1d110d91dceb9a90faa76d1d5d4
Usecase:
My query is of the form prefix type eg "sta", "star", "star w" .."start war" etc with an additional filter as tags = "science fiction". Also there queries could match other fields like description, actors(in cast field, not this is nested). I also want to know which field it matched to.
I investigated 2 ways for doing that but non of the methods seem to address the usecase above:
1) Suggester autocomplete:
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-suggesters-completion.html
With this it seems I have to add another field called "suggest" replicating the data which is not desirable.
2) using a prefix filter/query:
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/query-dsl-prefix-filter.html
this gives the whole document back not the exact matching terms.
Is there a clean way of achieving this, please advise.
Don't create mapping separately, insert data directly into index. It will create default mapping for that. Use below query for autocomplete.
GET /netflix/movie/_search
{
"query": {
"query_string": {
"query": "sta*"
}
}
}
I think completion suggester would be the cleanest way but if that is undesirable you could use aggregations on name field.
This is a sample index(I am assuming you are using ES 1.7 from your question
PUT netflix
{
"settings": {
"analysis": {
"analyzer": {
"prefix_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim",
"edge_filter"
]
},
"keyword_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim"
]
}
},
"filter": {
"edge_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
},
"mappings": {
"movie":{
"properties": {
"name":{
"type": "string",
"fields": {
"prefix":{
"type":"string",
"index_analyzer" : "prefix_analyzer",
"search_analyzer" : "keyword_analyzer"
},
"raw":{
"type": "string",
"analyzer": "keyword_analyzer"
}
}
},
"tags":{
"type": "string", "index": "not_analyzed"
}
}
}
}
}
Using multi-fields, name field is analyzed in different ways. name.prefix is using keyword tokenizer with edge ngram filter
so that string star wars can be broken into s, st, sta etc. but while searching, keyword_analyzer is used so that search query does not get broken into multiple small tokens. name.raw will be used for aggregation.
The following query will give top 10 suggestions.
GET netflix/movie/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"tags": "sci-fi"
}
},
"query": {
"match": {
"name.prefix": "sta"
}
}
}
},
"size": 0,
"aggs": {
"unique_movie_name": {
"terms": {
"field": "name.raw",
"size": 10
}
}
}
}
Results will be something like
"aggregations": {
"unique_movie_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "star trek",
"doc_count": 1
},
{
"key": "star wars",
"doc_count": 1
}
]
}
}
UPDATE :
You could use highlighting for this purpose I think. Highlight section will get you the whole word and which field it matched. You can also use inner hits and highlighting inside it to get nested docs also.
{
"query": {
"query_string": {
"query": "sta*"
}
},
"_source": false,
"highlight": {
"fields": {
"*": {}
}
}
}

Resources