Elasticsearch does not find characters other than alpha numeric - elasticsearch

I am facing a problem of searching some characters others than alphanumeric.
I tried with many analyzers, but think that for my problem the 'whitespace' analyzer fits perfectly.
I've created an index custom_doc and posted a doc
{
"body": "some text with ### hash signs # inside",
}
but I am not able to find this doc by passing hash inside query string
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [
"body"
],
"query": "#",
"analyzer": "whitespace"
}
}
]
}
}
}
However analyze shows it is tokenized correctly
request
{
"analyzer": "whitespace",
"text": "#"
}
result
{
"tokens": [
{
"token": "#",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
}
]
}
There is no custom analyzers, no mappings, no additional filters.
How can I solve the problem? I've checked many similar questions and no improvement. Some people advice to make the field as "not_analyzed" but I still want to have a possibility to use wildcards inside query string, thus changing the field type from "text" to "keyword" is not suitable to me as well. E.g. want this query "so*" to return the posted document.

The problem is that you also need to specify the whitespace analyzer at indexing time. Using it only at search time is not sufficient, because your body of text will have been analyzed by the standard analyzer which has removed the # signs, and thus, you cannot search for them afterwards.
First delete your index and recreate it with the following mapping:
DELETE index
PUT index
{
"mappings": {
"doc": {
"properties": {
"body": {
"type": "text",
"analyzer": "whitespace",
"search_analyzer": "whitespace"
}
}
}
}
}
Then index your document:
PUT index/doc/1
{ "body": "some text with ### hash signs # inside"}
Finally, you can search for the # sign (note that you don't need to specify the whitespace analyzer):
POST index/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [
"body"
],
"query": "#"
}
}
]
}
}
}

Related

How to search by words written together among data where these words are written apart in Elasticsearch?

I have documents which have, let's say, 1 field - name of this document. Name may consist of several words written apart, for example:
{
"name": "first document"
},
{
"name": "second document"
}
My goal is to be able to search for these documents by strings:
firstdocument, seconddocumen
As you can see, search strings are written wrong, but they still match those documents if we delete whitespaces from documents' names. This issue could be handled by creating another field with the same string but without whitespaces, but it seems like extra data unless there's no other way to do that.
I need something similar to this:
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":2,
"output_unigrams":"true",
"token_separator": ""
}
],
"text": "first document"
}
But the other way around. I need kind of apply this not to a search text, but for search objects (name of documents), so I could find documents with a little misspell in a search text. How should it be done?
I suggest using multi-fields with an analyzer for removing whitespaces.
Analyzer
"no_spaces": {
"filter": [
"lowercase"
],
"char_filter": [
"remove_spaces"
],
"tokenizer": "standard"
}
Char Filter
"remove_spaces": {
"type": "pattern_replace",
"pattern": "[ ]",
"replacement": ""
}
Field Mapping
"name": {
"type": "text",
"fields": {
"without_spaces": {
"type": "text",
"analyzer": "no_spaces"
}
}
}
Query
GET /_search
{
"query": {
"match": {
"name.without_spaces": {
"query": "seconddocumen",
"fuzziness": "AUTO"
}
}
}
}
EDIT:
For completion: An alternative to the remove_spaces filter could be the shingle filter:
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"output_unigrams": "false",
"token_separator": ""
}
},
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"shingle_filter"
]
}
}
}

How can I get auto-suggestions for synonyms match in elasticsearch

I'm using the code below and it does not give auto-suggestion as curd when i type "cu"
But it does match the document with yogurt which is correct.
How can I get both auto-complete for synonym words and document match for the same?
PUT products
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_graph"
]
}
},
"filter": {
"synonym_graph": {
"type": "synonym_graph",
"synonyms": [
"yogurt, curd, dahi"
]
}
}
}
}
}
}
PUT products/_mapping
{
"properties": {
"description": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
POST products/_doc
{
"description": "yogurt"
}
GET products/_search
{
"query": {
"match": {
"description": "cu"
}
}
}
When you provide a list of synonyms in a synonym_graph filter it simply means that ES will treat any of the synonyms interchangeably. But when they're analyzed via the standard analyzer, only full-word tokens will be produced:
POST products/_analyze?filter_path=tokens.token
{
"text": "yogurt",
"field": "description"
}
yielding:
{
"tokens" : [
{
"token" : "curd"
},
{
"token" : "dahi"
},
{
"token" : "yogurt"
}
]
}
As such, a regular match_query won't cut it here because the standard analyzer hasn't provided it with enough context in terms of matchable substrings (n-grams).
In the meantime you can replace match with match_phrase_prefix which does exactly what you're after -- match an ordered sequence of characters while taking into account the synonyms:
GET products/_search
{
"query": {
"match_phrase_prefix": {
"description": "cu"
}
}
}
But that, as the query name suggests, is only going to work for prefixes. If you fancy an autocomplete that suggests terms regardless of where the substring matches occur, have a look at my other answer where I talk about leveraging n-grams.

Elasticsearch synonyms that include spaces, commas and parentheses

I'm attempting to configure Elasticsearch (version 6.4) so it's possible to do full text search on documents that may contain chemical names using a number of chemical synonyms. The synonym terms can:
be multi-word (i.e. contain spaces)
contain hyphens
contain parentheses
contain commas
Can anyone help me come up with a configuration that meets these requirements?
The index config I have at the moment looks like this:
PUT /documents
{
"settings": {
"analysis": {
"analyzer": {
"chemical_synonyms": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase","chem_synonyms"]
},
"lower": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
},
"filter": {
"chem_synonyms": {
"type": "synonym_graph",
"synonyms":[
"N\\,N-Bis(2-hydroxyethyl)amine, Niax DEOA-LF, 111-42-2"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"text": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
},
"raw": {
"type": "text",
"analyzer":"lower"
}
}
}
}
}
}
}
This config contains a single line of SOLR style synonyms. In reality there are more and they come from a file, but the jist is the same.
Assume I have three documents:
PUT /documents/doc/1
{"text": "N,N-Bis(2-hydroxyethyl)amine"}
PUT /documents/doc/2
{"text": "Niax DEOA-LF"}
PUT /documents/doc/3
{"text": "111-42-2"}
If I run a search using this config:
POST /documents/_search
{
"query": {
"bool": {
"should": [
{
"query_string": {
"default_operator": "AND",
"type": "cross_fields",
"query": "\"N,N-Bis(2-hydroxyethyl)amine\""
}
},
{
"query_string": {
"default_operator": "AND",
"default_field": "*.raw",
"analyzer": "chemical_synonyms",
"query": "\"N,N-Bis(2-hydroxyethyl)amine\""
}
}
]
}
}
}
I would expect it to match all three documents, however it's currently not matching document 2. Changing the query to "111-42-2" also fails to match document 2. Searching for "Niax DEOA-LF" correctly matches all three.
How can I change either my index config or my search query (or both) so that a search for any one of these synonym terms will match all documents that contain any other of the synonym terms? Also normal full text searching must also continue to work so any changes can't prevent standard text searching of non-synonym terms from working.

Elasticsearch query returning false results when term exceeds ngram length

The requirement is to search partial phrases in a block of text. Most of the words will be standard length. I want to keep the max_gram value down to 10. But there may be the occasional id/code with more characters than that, and these show up if I type in a query where the first 10 characters match, but then the rest don't.
For example, here is the mapping:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
and document:
POST my_index/doc/1
{
"title": "Quick fox with id of ABCDEFGHIJKLMNOP"
}
If I run the query:
POST my_index/doc/_search
{
"query": {
"match_phrase": {
"title": {
"query": "fox wi"
}
}
}
}
It returns the document as expected. However, if I run this:
POST my_index/doc/_search
{
"query": {
"match_phrase": {
"title": {
"query": "ABCDEFGHIJxxx"
}
}
}
}
It also returns the document, when it shouldn't. It will do this if the x's are after the 10th character, but not before it. How can I avoid this?
I am using version 5.
By default, the analyzer that is used at index time is the same analyzer that is used at search time, meaning the edge_ngram analyzer is used on your search term. This is not what you want. You will end up with 10 tokens as the search terms, none of which contain those last 3 characters.
You will want to take a look at the Search Analyzer for your mapping. This documentation points out this specific use case:
Sometimes, though, it can make sense to use a different analyzer at search time, such as when using the edge_ngram tokenizer for autocomplete.
The standard analyzer may suit your needs:
{
...
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}

Ignore leading zeros with Elasticsearch

I am trying to create a search bar where the most common query will be for a "serviceOrderNo". "serviceOrderNo" is not a number field in the database, it is a string field. Examples:
000000007
000000002
WO0000042
123456789
AllTextss
000000054
000000065
000000874
The most common format is just an integer proceeded by some number of zeros.
How do I set up Elasticsearch so that searching for "65" will match "000000065"? I also want to give precedence to the "serviceOrderNo" field (which I already have working). Here is where I am at right now:
{
"query": {
"multi_match": {
"query": "65",
"fields": ["serviceOrderNo^2", "_all"],
}
}
}
One way of doing this is using the lucene flavour regular exression query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
"query": {
"regexp":{
"serviceOrderNo": "[0]*65"
}
}
Also, the Query String query also supports a small set of special characters, more limited set of regular expression characters, AS WELL AS lucene regular expressions the query would look like this:
https://www.elastic.co/guide/en/elasticsearch/reference/1.x/query-dsl-query-string-query.html
"query": {
"query_string": {
"default_field": "serviceOrderNo",
"query": "0*65"
}
}
These are fairly simple Regular expressions, both saying match the character(s) contained in the brackets [0] or the character 0 unlimited times *.
If you have the ability to reindex, or haven't indexed your data yet, you also have the ability to make this easier on yourself by writing a custom analyzer. Right now, you are using the default analyzer for Strings on your serviceOrderNo field. When you index "serviceOrderNo":"00000065" ES interprets this simply as 00000065.
Your custom analyzer could tokenize this field int both "0000065" and "65", using the same regular expression. The benefit of this is that the Regex only runs once at index time, instead of every time you run your query because ES will search against both "0000065" and "65".
You can also check out the ES website documentation on Analyzers.
"settings":{
"analysis": {
"filter":{
"trimZero": {
"type":"pattern_capture",
"patterns":"^0*([0-9]*$)"
}
},
"analyzer": {
"serviceOrderNo":{
"type":"custom",
"tokenizer":"standard",
"filter":"trimZero"
}
}
}
},
"mappings":{
"serviceorderdto": {
"properties":{
"serviceOrderNo":{
"type":"String",
"analyzer":"serviceOrderNo"
}
}
}
}
One way to do this is to use an ngram token filter so that "12345" gets tokenized as:
[ 1, 2, 3, 4, 5 ]
[ 12, 23, 34, 45 ]
[ 123, 234, 345 ]
[ 12345 ]
When tokenized this way, "65" is a match for "000000065".
To set this up, create a new index that has a custom analyzer that uses an ngram filter:
POST /my-index
{
"mappings": {
"serviceorderdto": {
"properties": {
"serviceOrderNo": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Index some data.
Then run your query:
GET /my-index/_search
{
"query": {
"multi_match": {
"query": "55",
"fields": [
"serviceOrderNo^2",
"_all"
]
}
}
}

Resources