We are running Elasticsearch 1.7 (planning to upgrade very soon) and I am trying to use the Analyze API to understand what the different analyzers do, but the result presented from elasticsearch is not what I expect.
If I run the following query against our elasticsearch instance
GET _analyze
{
"analyzer": "stop",
"text": "Extremely good food! We had the happiest waiter and the crowd's always flowing!"
}
I will get this result
{
"tokens": [
{
"token": "analyzer",
"start_offset": 6,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "stop",
"start_offset": 18,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "text",
"start_offset": 30,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "extremely",
"start_offset": 38,
"end_offset": 47,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "good",
"start_offset": 48,
"end_offset": 52,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "food",
"start_offset": 53,
"end_offset": 57,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "we",
"start_offset": 59,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "had",
"start_offset": 62,
"end_offset": 65,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "the",
"start_offset": 66,
"end_offset": 69,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "happiest",
"start_offset": 70,
"end_offset": 78,
"type": "<ALPHANUM>",
"position": 10
},
{
"token": "waiter",
"start_offset": 79,
"end_offset": 85,
"type": "<ALPHANUM>",
"position": 11
},
{
"token": "and",
"start_offset": 86,
"end_offset": 89,
"type": "<ALPHANUM>",
"position": 12
},
{
"token": "the",
"start_offset": 90,
"end_offset": 93,
"type": "<ALPHANUM>",
"position": 13
},
{
"token": "crowd's",
"start_offset": 94,
"end_offset": 101,
"type": "<ALPHANUM>",
"position": 14
},
{
"token": "always",
"start_offset": 102,
"end_offset": 108,
"type": "<ALPHANUM>",
"position": 15
},
{
"token": "flowing",
"start_offset": 109,
"end_offset": 116,
"type": "<ALPHANUM>",
"position": 16
}
]
}
which does not make sense to me. I am using the stop analyzer, why is the words "and" and "the" in the result? I have tried to change the stop analyzer to both whitespace and standard, but I get the exact same result as above. There is no difference between them.
However, if I run the exact same query against an instance of Elasticsearch 5.x the result does no longer contain "and" and "the" and it seems much more as expected.
Is this because we are using 1.7 or is it something in our setup of Elasticsearch that is causing this issue?
edit:
I am using Sense plugin in chrome to do my queries, and the plugin does not support GET with a request body so it changes the request to a POST. Elastic Analyze API 1.7 does not seem to support POST requests :( If I change the query like this GET _analyze?analyzer=stop&text=THIS+is+a+test&pretty it works
In 1.x the syntax is different from 2.x and 5.x. According to the 1.x documentation, you should be using the _analyze API like this:
GET _analyze?analyzer=stop
{
"text": "Extremely good food! We had the happiest waiter and the crowd's always flowing!"
}
Related
I know that elasicsearch's standard analyzer uses standard tokenizer to generate tokens.
In this elasticsearch docs, they say it does grammar-based tokenization, but the separators used by standard tokenizer are not clear.
My use case is as follows
In my elasticsearch index I have some fields which use the default analyzer standard analyzer
In those fields I want # character searchable and . as one more separator.
Can I achieve my use case with a standard analyzer?
I checked what and all tokens it will generate for string hey john.s #100 is a test name.
POST _analyze
{
"text": "hey john.s #100 is a test name",
"analyzer": "standard"
}
It generated the following tokens
{
"tokens": [
{
"token": "hey",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "john.s",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "100",
"start_offset": 12,
"end_offset": 15,
"type": "<NUM>",
"position": 2
},
{
"token": "is",
"start_offset": 16,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "a",
"start_offset": 19,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "test",
"start_offset": 21,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "name",
"start_offset": 26,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 6
}
]
}
So I got a doubt that Only whitespace is used as a separator in standard tokenizer?
Thank you in advance..
Lets first see why it is not breaking token on . for some of the words:
Standard analyzer use standard tokenizer only but standard tokenizer provides grammar based tokenization based on the Unicode Text Segmentation algorithm. You can read more about algorithm here, here and here. it is not using whitespace tokenizer.
Lets see now, how you can token on . dot and not on #:
You can use Character Group tokenizer and provide list of character on which you want to apply tokenization.
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
".",
"\n"
]
},
"text": "hey john.s #100 is a test name"
}
Response:
{
"tokens": [
{
"token": "hey",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "john",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "s",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 2
},
{
"token": "#100",
"start_offset": 11,
"end_offset": 15,
"type": "word",
"position": 3
},
{
"token": "is",
"start_offset": 16,
"end_offset": 18,
"type": "word",
"position": 4
},
{
"token": "a",
"start_offset": 19,
"end_offset": 20,
"type": "word",
"position": 5
},
{
"token": "test",
"start_offset": 21,
"end_offset": 25,
"type": "word",
"position": 6
},
{
"token": "name",
"start_offset": 26,
"end_offset": 30,
"type": "word",
"position": 7
}
]
}
This is a sample document with the following points:
Pharmaceutical
Marketing
Building –
responsibilities. Â
Mass. – Aug. 13, 2020 –Â
How to remove the special characters or non ascii unicode chars from content while indexing? I'm using ES 7.x and storm crawler 1.17
Looks like an incorrect detection of charset. You could normalise the content before indexing by writing a custom parse filter and remove the unwanted characters there.
If writing a custom parse filter and normalization looks difficult for you. you can simply add the asciifolding token filter in your analyzer definition which would convert the non-ascii char to their ascii char as shown below
POST http://{{hostname}}:{{port}}/_analyze
{
"tokenizer": "standard",
"filter": [
"asciifolding"
],
"text": "Pharmaceutical Marketing Building â responsibilities.  Mass. â Aug. 13, 2020 âÂ"
}
And generated tokens for your text.
{
"tokens": [
{
"token": "Pharmaceutical",
"start_offset": 0,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Marketing",
"start_offset": 15,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "Building",
"start_offset": 25,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "a",
"start_offset": 34,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "responsibilities.A",
"start_offset": 36,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "A",
"start_offset": 55,
"end_offset": 56,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "Mass",
"start_offset": 57,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "a",
"start_offset": 63,
"end_offset": 64,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "Aug",
"start_offset": 65,
"end_offset": 68,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "13",
"start_offset": 70,
"end_offset": 72,
"type": "<NUM>",
"position": 9
},
{
"token": "2020",
"start_offset": 74,
"end_offset": 78,
"type": "<NUM>",
"position": 10
},
{
"token": "aA",
"start_offset": 79,
"end_offset": 81,
"type": "<ALPHANUM>",
"position": 11
}
]
}
I wanted to find out how elasticsearch tokens the languages other than english and i tried out the analyze api provided by it. But I cannot understand the output at all. Take for example
GET myindex/_analyze?analyzer=hindi&text="में कहता हूँ और तुम सुनना "
Now in the above text there are 6 words in total so I expect at max 6 tokens( believing that text contains no stop words) but the output is somewhat like this
{
"tokens": [
{
"token": "2350",
"start_offset": 3,
"end_offset": 7,
"type": "<NUM>",
"position": 1
},
{
"token": "2375",
"start_offset": 10,
"end_offset": 14,
"type": "<NUM>",
"position": 2
},
{
"token": "2306",
"start_offset": 17,
"end_offset": 21,
"type": "<NUM>",
"position": 3
},
{
"token": "2325",
"start_offset": 25,
"end_offset": 29,
"type": "<NUM>",
"position": 4
},
{
"token": "2361",
"start_offset": 32,
"end_offset": 36,
"type": "<NUM>",
"position": 5
},
{
"token": "2340",
"start_offset": 39,
"end_offset": 43,
"type": "<NUM>",
"position": 6
},
{
"token": "2366",
"start_offset": 46,
"end_offset": 50,
"type": "<NUM>",
"position": 7
},
{
"token": "2361",
"start_offset": 54,
"end_offset": 58,
"type": "<NUM>",
"position": 8
},
{
"token": "2370",
"start_offset": 61,
"end_offset": 65,
"type": "<NUM>",
"position": 9
},
{
"token": "2305",
"start_offset": 68,
"end_offset": 72,
"type": "<NUM>",
"position": 10
},
{
"token": "2324",
"start_offset": 76,
"end_offset": 80,
"type": "<NUM>",
"position": 11
},
{
"token": "2352",
"start_offset": 83,
"end_offset": 87,
"type": "<NUM>",
"position": 12
},
{
"token": "2340",
"start_offset": 91,
"end_offset": 95,
"type": "<NUM>",
"position": 13
},
{
"token": "2369",
"start_offset": 98,
"end_offset": 102,
"type": "<NUM>",
"position": 14
},
{
"token": "2350",
"start_offset": 105,
"end_offset": 109,
"type": "<NUM>",
"position": 15
},
{
"token": "2360",
"start_offset": 113,
"end_offset": 117,
"type": "<NUM>",
"position": 16
},
{
"token": "2369",
"start_offset": 120,
"end_offset": 124,
"type": "<NUM>",
"position": 17
},
{
"token": "2344",
"start_offset": 127,
"end_offset": 131,
"type": "<NUM>",
"position": 18
},
{
"token": "2344",
"start_offset": 134,
"end_offset": 138,
"type": "<NUM>",
"position": 19
},
{
"token": "2366",
"start_offset": 141,
"end_offset": 145,
"type": "<NUM>",
"position": 20
}
]
}
That means instead of six elasticsearch has detected around 20 tokens and all of type NUM(I don't know what's that)
I am really confused why this is happening. Can someone enlighten me what is happening. What am I doing doing wrong or where I lack in my understanding?
How are you calling the elasticsearch API - possibly the Hindi characters are getting messed up by your client?
It works okay for me (at least the Hindi chars are appearing in the result) on Linux with curl:
curl -XPOST 'http://localhost:9200/myindex/_analyze?analyzer=hindi&pretty' -d 'में कहता हूँ और तुम सुनना '
{
"tokens" : [ {
"token" : "कह",
"start_offset" : 4,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "हुं",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "तुम",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 5
}, {
"token" : "सुन",
"start_offset" : 20,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 6
} ]
}
I am trying to use elasticsearch to index some data about research paper. But I'am figthing to accents. For intance, if I use:
GET /_analyze?tokenizer=standard&filter=asciifolding&text="Boletínes de investigaciónes" I get
{
"tokens": [
{
"token": "Bolet",
"start_offset": 1,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "nes",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "de",
"start_offset": 11,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "investigaci",
"start_offset": 14,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "nes",
"start_offset": 26,
"end_offset": 29,
"type": "<ALPHANUM>",
"position": 5
}
]
}
and I should to get something like that
{
"tokens": [
{
"token": "Boletines",
"start_offset": 1,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "de",
"start_offset": 11,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "investigacion",
"start_offset": 14,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
}
]
}
what should I do?
To prevent the extra tokens being formed, you need to use an alternative tokenizer, e.g. try the whitespace tokenizer.
Alternatively use a language analyzer and specify the language.
You should use a ASCII folding filter in your analyzer.
For example, the filter changes à to a.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-asciifolding-tokenfilter.html
My team is attempting to index our item information and need a sanity check on what I have created so far. Here is an example of some of the text we need to search on:
AA4VG90EP4DM1/32R-NSF52F001DX-S WITH DAMAN MANIFOLD 0281CS0011 SIEMENS SPEC 74/07104909/10 REV L WOMACK SYSTEMS
As you can see there is a mix of english words and random numbers and letters. After doing some research online I decided to go with a word delimiter and whitespace tokenizer. Here is the analyzer I am currently using:
{
{
"itemindex": {
"settings": {
"index": {
"uuid": "1HxasKSCSW2iRHf6pYfkWw",
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"filter": [
"lowercase",
"my_word_delimiter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"my_word_delimiter": {
"type_table": "/ => ALPHANUM",
"preserve_original": "true",
"catenate_words": "true",
"type": "word_delimiter"
}
}
},
"number_of_replicas": "1",
"number_of_shards": "5",
"version": {
"created": "1000099"
}
}
}
}
}
Here is the output from the analyze api for the above description:
{
"tokens": [
{
"token": "aa4vg90ep4dm1/32r-nsf52f001dx-s",
"start_offset": 0,
"end_offset": 31,
"type": "word",
"position": 1
},
{
"token": "aa",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "4",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "vg",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "90",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 4
},
{
"token": "ep",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 5
},
{
"token": "4",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 6
},
{
"token": "dm",
"start_offset": 10,
"end_offset": 12,
"type": "word",
"position": 7
},
{
"token": "1/32",
"start_offset": 12,
"end_offset": 16,
"type": "word",
"position": 8
},
{
"token": "r",
"start_offset": 16,
"end_offset": 17,
"type": "word",
"position": 9
},
{
"token": "nsf",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 10
},
{
"token": "rnsf",
"start_offset": 16,
"end_offset": 21,
"type": "word",
"position": 10
},
{
"token": "52",
"start_offset": 21,
"end_offset": 23,
"type": "word",
"position": 11
},
{
"token": "f",
"start_offset": 23,
"end_offset": 24,
"type": "word",
"position": 12
},
{
"token": "001",
"start_offset": 24,
"end_offset": 27,
"type": "word",
"position": 13
},
{
"token": "dx",
"start_offset": 27,
"end_offset": 29,
"type": "word",
"position": 14
},
{
"token": "s",
"start_offset": 30,
"end_offset": 31,
"type": "word",
"position": 15
},
{
"token": "dxs",
"start_offset": 27,
"end_offset": 31,
"type": "word",
"position": 15
},
{
"token": "with",
"start_offset": 32,
"end_offset": 36,
"type": "word",
"position": 16
},
{
"token": "daman",
"start_offset": 37,
"end_offset": 42,
"type": "word",
"position": 17
},
{
"token": "manifold",
"start_offset": 43,
"end_offset": 51,
"type": "word",
"position": 18
},
{
"token": "0281cs0011",
"start_offset": 52,
"end_offset": 62,
"type": "word",
"position": 19
},
{
"token": "0281",
"start_offset": 52,
"end_offset": 56,
"type": "word",
"position": 19
},
{
"token": "cs",
"start_offset": 56,
"end_offset": 58,
"type": "word",
"position": 20
},
{
"token": "0011",
"start_offset": 58,
"end_offset": 62,
"type": "word",
"position": 21
},
{
"token": "siemens",
"start_offset": 63,
"end_offset": 70,
"type": "word",
"position": 22
},
{
"token": "spec",
"start_offset": 71,
"end_offset": 75,
"type": "word",
"position": 23
},
{
"token": "74/07104909/10",
"start_offset": 76,
"end_offset": 90,
"type": "word",
"position": 24
},
{
"token": "rev",
"start_offset": 91,
"end_offset": 94,
"type": "word",
"position": 25
},
{
"token": "l",
"start_offset": 95,
"end_offset": 96,
"type": "word",
"position": 26
},
{
"token": "womack",
"start_offset": 98,
"end_offset": 104,
"type": "word",
"position": 27
},
{
"token": "systems",
"start_offset": 105,
"end_offset": 112,
"type": "word",
"position": 28
}
]
}
Finally, here is the NEST query I am using:
var results = client.Search<SearchResult>(s => s.Index("itemindex")
.Query(q => q
.QueryString(qs=> qs
.OnFields(f=> f.Description, f=> f.VendorPartNumber, f=> f.ItemNumber)
.Operator(Operator.or)
.Query(query + "*")))
.SortDescending("_score")
.Highlight(h => h
.OnFields(f => f
.OnField(e => e.Description)
.BoundaryCharacters(" ,")
.PreTags("<b>")
.PostTags("</b>")))
.From(start)
.Size(size));