Analyzers in ElasticSearch not working - elasticsearch

I am using ElasticSearch to store the Tweets I receive from the Twitter Streaming API. Before storing them I'd like to apply an english stemmer to the Tweet content, and to do that I'm trying to use ElasticSearch analyzers with no luck.
This is the current template I am using:
PUT _template/twitter
{
"template": "139*",
"settings" : {
"index":{
"analysis":{
"analyzer":{
"english":{
"type":"custom",
"tokenizer":"standard",
"filter":["lowercase", "en_stemmer", "stop_english", "asciifolding"]
}
},
"filter":{
"stop_english":{
"type":"stop",
"stopwords":["_english_"]
},
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
}
},
"mappings": {
"tweet": {
"_timestamp": {
"enabled": true,
"store": true,
"index": "analyzed"
},
"_index": {
"enabled": true,
"store": true,
"index": "analyzed"
},
"properties": {
"geo": {
"properties": {
"coordinates": {
"type": "geo_point"
}
}
},
"text": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
When I start the Streaming and the index is created, all the mappings I've defined seem to apply correctly, but the text is stored as it comes from Twitter, completely raw. The index metadata shows:
"settings" : {
"index" : {
"uuid" : "xIOkEcoySAeZORr7pJeTNg",
"analysis" : {
"filter" : {
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
},
"stop_english" : {
"type" : "stop",
"stopwords" : [
"_english_"
]
}
},
"analyzer" : {
"english" : {
"type" : "custom",
"filter" : [
"lowercase",
"en_stemmer",
"stop_english",
"asciifolding"
],
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "1",
"number_of_shards" : "5",
"version" : {
"created" : "1010099"
}
}
},
"mappings" : {
"tweet" : {
[...]
"text" : {
"analyzer" : "english",
"type" : "string"
},
[...]
}
}
What am I doing wrong? The analyzers seems to be applied correctly, but nothing is happening :/
Thank you!
PS: The search query I use to realize the analyzer is not being applied:
curl -XGET 'http://localhost:9200/_all/_search?pretty' -d '{
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "_index:1397574496990"
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"match_all": {}
},
{
"exists": {
"field": "geo.coordinates"
}
}
]
}
}
}
},
"fields": [
"geo.coordinates",
"text"
],
"size": 50000
}'
This should return the stemmed text as one of the fields, but the response is:
{
"took": 29,
"timed_out": false,
"_shards": {
"total": 47,
"successful": 47,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.97402453,
"hits": [
{
"_index": "1397574496990",
"_type": "tweet",
"_id": "456086643423068161",
"_score": 0.97402453,
"fields": {
"geo.coordinates": [
-118.21122533,
33.79349318
],
"text": [
"Happy turtle Tuesday ! The week is slowly crawling to Wednesday good morning everyone 🌊🐢🐢🐢☀️#turtles… http://t.co/wAVmcxnf76"
]
}
},
{
"_index": "1397574496990",
"_type": "tweet",
"_id": "456086701451259904",
"_score": 0.97333175,
"fields": {
"geo.coordinates": [
-81.017636,
33.998741
],
"text": [
"Tuesday is Twins Day over here, apparently (it's a far too often occurrence) #tuesdaytwinsday… http://t.co/Umhtp6SoX6"
]
}
}
]
}
}
The text field is exactly the same that came from Twitter (I'm using the streaming API). What I expect is the text fields stemmed, as the analyzer is applied.

Analyzers don't affect the way data is stored. So, no matter which analyzer you are using you will get the same text back from source and stored fields. Analyzer are applied when you search. So by searching for something like text:twin and finding records with the word Twins, you will know that stemmer was applied.

Related

How to find all Nike products with "nikeeeee" keyword

I have an Elasticsearch db with 15 million products.
I want to write a query to find all Nike products when my user search "nikeeeee" or "nikeoff" or "bestnike" or "nikenike" or "nike-Nike" or some keywords like these.
When I used Fazzy query, The result returned was not relevant.
How can i handle it?
Thanks in advance
To find all Nike products when my user search "nikeeeee"
As mentioned by #rabbitbr you can use synonymns
Using synonyms multiple words can point to same token.
Below is working example of it
PUT <index-name>
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym"
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"lenient": true,
"synonyms": [
"nikeeeee,nikeoff,bestnike,nikenike,nike-nike => nike"
]
}
}
}
}
}
}
POST <index-name>/_doc
{
"title":"adidas shoes"
}
GET <index-name>/_search
{
"query": {
"match": {
"title": "bestnike"
}
}
}
Result
"hits" : [
{
"_index" : "index50",
"_type" : "_doc",
"_id" : "ky9VF4QBpvliSuG-OTh-",
"_score" : 0.6931471,
"_source" : {
"title" : "nike shoes"
}
}
]

Strange behavior of range query in Elasticsearch

My question is pretty simple. I have an ES index which contains field updated that is a UNIX timestamp. I only have testing records (documents) in my index, which were all created today.
I have a following query, which works well and (righfully) doesn't return any results when executed:
GET /test_index/_search
{
"size": 1,
"query": {
"bool": {
"must": [
{
"range": {
"updated": {
"lt": "159525360"
}
}
}
]
}
},
"sort": [
{
"updated": {
"order": "desc",
"mode": "avg"
}
}
]
}
So this is all ok. However, when I change timestamp in my query to lower number, I am getting multiple results! And these results all contain much larger values in updated field than 5000! Even more bafflingly, I am getting results with updated only being set in range of 1971 to 9999. So numbers like 1500 or 10000 behave corectly and I see no results. Query behaving strangely is below.
GET /test_index/_search
{
"size": 100,
"query": {
"bool": {
"must": [
{
"range": {
"updated": {
"lt": "5000"
}
}
}
]
}
},
"sort": [
{
"updated": {
"order": "desc",
"mode": "avg"
}
}
]
}
Btw, this is how my typical document stored in this index looks like:
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "V6LDyHMBAUKhWZ7lxRtb",
"_score" : null,
"_source" : {
"councilId" : 111,
"chargerId" : "15",
"unitId" : "a",
"connectorId" : "2",
"status" : 10,
"latitude" : 77.7,
"longitude" : 77.7,
"lastStatusChange" : 1596718920,
"updated" : 1596720720,
"dataType" : "recorded"
},
"sort" : [
1596720720
]
}
Here is a mapping of this index:
PUT /test_index/_mapping
{
"properties": {
"chargerId": { "type": "text"},
"unitId": { "type": "text"},
"connectorId": { "type": "text"},
"councilId": { "type": "integer"},
"status": {"type": "integer"},
"longitude" : {"type": "double"},
"latitude" : {"type": "double"},
"lastStatusChange" : {"type": "date"},
"updated": {"type": "date"}
}
}
Is there any explanation for this?
The default format for a date field in ES is
strict_date_optional_time||epoch_millis. Since you haven't specified epoch_second, your dates were incorrectly parsed (treated as millis since epoch). It's verifiable by running this script:
GET test_index/_search
{
"script_fields": {
"updated_pretty": {
"script": {
"lang": "painless",
"source": """
LocalDateTime.ofInstant(
Instant.ofEpochMilli(doc['updated'].value.millis),
ZoneId.of('Europe/Vienna')
).format(DateTimeFormatter.ofPattern("dd/MM/yyyy HH:mm"))
"""
}
}
}
}
Quick fix: update your mapping as follows:
{
...
"updated":{
"type":"date",
"format":"epoch_second"
}
}
and reindex.

How to Query elasticsearch index with nested and non nested fields

I have an elastic search index with the following mapping:
PUT /student_detail
{
"mappings" : {
"properties" : {
"id" : { "type" : "long" },
"name" : { "type" : "text" },
"email" : { "type" : "text" },
"age" : { "type" : "text" },
"status" : { "type" : "text" },
"tests":{ "type" : "nested" }
}
}
}
Data stored is in form below:
{
"id": 123,
"name": "Schwarb",
"email": "abc#gmail.com",
"status": "current",
"age": 14,
"tests": [
{
"test_id": 587,
"test_score": 10
},
{
"test_id": 588,
"test_score": 6
}
]
}
I want to be able to query the students where name like '%warb%' AND email like '%gmail.com%' AND test with id 587 have score > 5 etc. The high level of what is needed can be put something like below, dont know what would be the actual query, apologize for this messy query below
GET developer_search/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": IN [587]
}
},
{
"term": {
"tests.test_score": >= some value
}
}
]
}
}
}
}
]
}
}
}
The query must be flexible so that we can enter dynamic test Ids and their respective score filters along with the fields out of nested fields like age, name, status
Something like that?
GET student_detail/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "*warb*"
}
}
},
{
"wildcard": {
"email": {
"value": "*gmail.com*"
}
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
},
"inner_hits": {}
}
}
]
}
}
}
Inner hits is what you are looking for.
You must make use of Ngram Tokenizer as wildcard search must not be used for performance reasons and I wouldn't recommend using it.
Change your mapping to the below where you can create your own Analyzer which I've done in the below mapping.
How elasticsearch (albiet lucene) indexes a statement is, first it breaks the statement or paragraph into words or tokens, then indexes these words in the inverted index for that particular field. This process is called Analysis and that this would only be applicable on text datatype.
So now you only get the documents if these tokens are available in inverted index.
By default, standard analyzer would be applied. What I've done is I've created my own analyzer and used Ngram Tokenizer which would be creating many more tokens than just simply words.
Default Analyzer on Life is beautiful would be life, is, beautiful.
However using Ngrams, the tokens for Life would be lif, ife & life
Mapping:
PUT student_detail
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"email" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age" : {
"type" : "text" <--- I am not sure why this is text. Change it to long or int. Would leave this to you
},
"status" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tests":{
"type" : "nested"
}
}
}
}
Note that in the above mapping I've created a sibling field in the form of keyword for name, email and status as below:
"name":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
Now your query could be as simple as below.
Query:
POST student_detail/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "war" <---- Note this. This would even return documents having "Schwarb"
}
},
{
"match": {
"email": "gmail" <---- Note this
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
}
}
}
]
}
}
}
Note that for exact matches I would make use of Term Queries on keyword fields while for normal searches or LIKE in SQL I would make use of simple Match Queries on text Fields provided they make use of Ngram Tokenizer.
Also note that for >= and <= you would need to make use of Range Query.
Response:
{
"took" : 233,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.7260926,
"hits" : [
{
"_index" : "student_detail",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.7260926,
"_source" : {
"id" : 123,
"name" : "Schwarb",
"email" : "abc#gmail.com",
"status" : "current",
"age" : 14,
"tests" : [
{
"test_id" : 587,
"test_score" : 10
},
{
"test_id" : 588,
"test_score" : 6
}
]
}
}
]
}
}
Note that I observe the document you've mentioned in your question, in my response when I run the query.
Please do read the links I've shared. It is vital that you understand the concepts. Hope this helps!

How Elasticsearch relevance score gets calculated?

I am using multi_match with phrase_prefix for full text search in Elasticsearch 5.5. ES query looks like
{
query: {
bool: {
must: {
multi_match: {
query: "butt",
type: "phrase_prefix",
fields: ["item.name", "item.keywords"],
max_expansions: 10
}
}
}
}
}
I am getting following response
[
{
"_index": "items_index",
"_type": "item",
"_id": "2",
"_score": 0.61426216,
"_source": {
"item": {
"keywords": "amul butter, milk, butter milk, flavoured",
"name": "Flavoured Butter"
}
}
},
{
"_index": "items_index",
"_type": "item",
"_id": "1",
"_score": 0.39063013,
"_source": {
"item": {
"keywords": "amul butter, milk, butter milk",
"name": "Butter Milk"
}
}
}
]
Mappings is as follows(I am using default mappings)
{
"items_index" : {
"mappings" : {
"parent_doc": {
...
"properties": {
"item" : {
"properties" : {
"keywords" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
How item with "name": "Flavoured Butter" getting higher score of 0.61426216 against the document with "name": "Butter Milk" and score 0.39063013?
I tried applying boost to "item.name" and removing "item.keywords" form search fields getting same results.
How scores in Elasticsearch works? Are above results correct in terms of relavance?
The scoring for phrase_prefix is similar to that of best_fields, meaning that score of a document is the score obtained from the best_field, which here is item.keywords.
So, item.name isn't adding to score
Refer: multi-match-types
You can use 2 multi_match queries to combine the score from keywords and name.
{
"query": {
"bool": {
"must": [{
"multi_match": {
"query": "butt",
"type": "phrase_prefix",
"fields": [
"item.keywords"
],
"max_expansions": 10
}
},{
"multi_match": {
"query": "butt",
"type": "phrase_prefix",
"fields": [
"item.name"
],
"max_expansions": 10
}
}]
}
}
}

Wrong indexation elasticsearch using the analyser

I did a pretty simple test. I build a student index and a type, then I define a mapping:
POST student
{
"mappings" : {
"ing3" : {
"properties" : {
"quote": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
After that I add 3 students to this index:
POST /student/ing3/1
{
"name": "Smith",
"first_name" : "John",
"quote" : "Learning is so cool!!"
}
POST /student/ing3/2
{
"name": "Roosevelt",
"first_name" : "Franklin",
"quote" : "I learn everyday"
}
POST /student/ing3/3
{
"name": "Black",
"first_name" : "Mike",
"quote" : "I learned a lot at school"
}
At this point I thought that the english tokeniser will tokenise all the word in my quotes so if I'm making a search like:
GET /etudiant/ing3/_search
{
"query" : {
"term" : { "quote" : "learn" }
}
}
I will have all the document as a result since my tokeniser will make equal "learn, learning, learned" and I was right. But when I try this request:
GET /student/ing3/_search
{
"query" : {
"term" : { "quote" : "learned" }
}
}
I got zero hit and in my opinion I should have the 3rd document (at least?). But for me Elasticsearch is also supposed to index learned and learning not only learn. Am I wrong? Is my request wrong?
If you check:
GET 'index/_analyze?field=quote' -d "I learned a lot at school"
you will see that your sentence is analyzed as:
{
"tokens":[
{
"token":"i",
"start_offset":0,
"end_offset":1,
"type":"<ALPHANUM>",
"position":0
},
{
"token":"learn",
"start_offset":2,
"end_offset":9,
"type":"<ALPHANUM>",
"position":1
},
{
"token":"lot",
"start_offset":12,
"end_offset":15,
"type":"<ALPHANUM>",
"position":3
},
{
"token":"school",
"start_offset":19,
"end_offset":25,
"type":"<ALPHANUM>",
"position":5
}
]
}
So english analyzer removes punctions and stop words and tokenize words in their root form.
https://www.elastic.co/guide/en/elasticsearch/guide/current/using-language-analyzers.html
You can use match query which will also analyze your search text so will match:
GET /etudiant/ing3/_search
{
"query" : {
"match" : { "quote" : "learned" }
}
}
There is another way. You can both stem the terms (the english analyzer does have a stemmer), but also keep the original terms, by using a keyword_repeat token filter and then using a unique token filter with "only_on_same_position": true to remove unnecessary duplicates after the stemming:
PUT student
{
"settings": {
"analysis": {
"analyzer": {
"myAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"keyword_repeat",
"english_stemmer",
"unique_stem"
]
}
},
"filter": {
"unique_stem": {
"type": "unique",
"only_on_same_position": true
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
},
"mappings": {
"ing3": {
"properties": {
"quote": {
"type": "string",
"analyzer": "myAnalyzer"
}
}
}
}
}
In this case the term query will work, as well. If you look at what terms are actually being indexed:
GET /student/_search
{
"fielddata_fields": ["quote"]
}
it will be clear why now it matches:
"hits": [
{
"_index": "student",
"_type": "ing3",
"_id": "2",
"_score": 1,
"_source": {
"name": "Roosevelt",
"first_name": "Franklin",
"quote": "I learn everyday"
},
"fields": {
"quote": [
"everydai",
"everyday",
"i",
"learn"
]
}
},
{
"_index": "student",
"_type": "ing3",
"_id": "1",
"_score": 1,
"_source": {
"name": "Smith",
"first_name": "John",
"quote": "Learning is so cool!!"
},
"fields": {
"quote": [
"cool",
"learn",
"learning",
"so"
]
}
},
{
"_index": "student",
"_type": "ing3",
"_id": "3",
"_score": 1,
"_source": {
"name": "Black",
"first_name": "Mike",
"quote": "I learned a lot at school"
},
"fields": {
"quote": [
"i",
"learn",
"learned",
"lot",
"school"
]
}
}
]

Resources