Count total number of words of all documents pointing to specific fields - elasticsearch

Someone asked this question but no one seems to answer or tried to suggest possible ways to solve it: https://discuss.elastic.co/t/count-the-number-of-words-in-the-field-elastic-search-6-2/121373
Now, I'm trying to produce a report from Elasticsearch to count the number of WORDS / TOKENS from a specific field called title and content
Is there a proper aggregation for this?
For example, I have this query:
GET web/_search
{
"query":{
"bool":{
"must":[
{
"query_string":{
"fields":[
"title",
"content"
],
"query":"((\"Hello\") AND (\"World\")"
}
},
{
"range":{
"pub_date":{
"from":1569456000,
"to":1570060800
}
}
}
]
}
}
}
And for example, this query produced 23 DOCUMENTS, I want to make a response telling me how MANY words do those 23 documents contain based from the title and content fields?

I would leverage the token_count data type. In your index, you can add a sub-field of type token_count to your title and content fields, like this:
PUT web
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
},
"content": {
"type": "text",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
Then, in order to find out the number of tokens, you can simply run a sum aggregation on the .length sub-field, like this:
POST web/_search
{
"size": 0,
"aggs": {
"title_tokens": {
"sum": {
"field": "title.length"
}
},
"content_tokens": {
"sum": {
"field": "content.length"
}
}
}
}

I am using data type called token_count It will calculate and store the count of tokens for each text. This count value can be utilized to get the token count of fields
PUT index18
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
Data:
"hits" : [
{
"_index" : "index18",
"_type" : "_doc",
"_id" : "edJPtW0BVHM68p7X-Wlu",
"_score" : 1.0,
"_source" : {
"title" : "Mayor Isko"
}
},
{
"_index" : "index18",
"_type" : "_doc",
"_id" : "etJQtW0BVHM68p7XGmmr",
"_score" : 1.0,
"_source" : {
"title" : "Isko"
}
}
]
Query
GET index18/_search
{
"query": {"match_all": {}},
"aggs": {
"WordCount": {
"sum": {
"field": "title.length"
}
}
}
}

Related

How to find all Nike products with "nikeeeee" keyword

I have an Elasticsearch db with 15 million products.
I want to write a query to find all Nike products when my user search "nikeeeee" or "nikeoff" or "bestnike" or "nikenike" or "nike-Nike" or some keywords like these.
When I used Fazzy query, The result returned was not relevant.
How can i handle it?
Thanks in advance
To find all Nike products when my user search "nikeeeee"
As mentioned by #rabbitbr you can use synonymns
Using synonyms multiple words can point to same token.
Below is working example of it
PUT <index-name>
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym"
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"lenient": true,
"synonyms": [
"nikeeeee,nikeoff,bestnike,nikenike,nike-nike => nike"
]
}
}
}
}
}
}
POST <index-name>/_doc
{
"title":"adidas shoes"
}
GET <index-name>/_search
{
"query": {
"match": {
"title": "bestnike"
}
}
}
Result
"hits" : [
{
"_index" : "index50",
"_type" : "_doc",
"_id" : "ky9VF4QBpvliSuG-OTh-",
"_score" : 0.6931471,
"_source" : {
"title" : "nike shoes"
}
}
]

why elasticsearch keyword search not working?

i use NLog to write log message to Elasticsearch, the index structure is here:
"mappings": {
"logevent": {
"properties": {
"#timestamp": {
"type": "date"
},
"MachineName": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"level": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"message": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}
}
I was able to get results using a text search:
GET /webapi-2022.07.28/_search
{
"query": {
"match": {
"message": "ERROR"
}
}
}
result
"hits" : [
{
"_index" : "webapi-2022.07.28",
"_type" : "logevent",
"_id" : "IFhYQoIBRhF4cR9wr-ja",
"_score" : 4.931916,
"_source" : {
"#timestamp" : "2022-07-28T01:07:58.8822339Z",
"level" : "Error",
"message" : """2022-07-28 09:07:58.8822|ERROR|AppSrv.Filter.AccountAuthorizeAttribute|[KO17111808]-[172.10.2.200]-[ERROR]-"message"""",
"MachineName" : "WIN-EPISTFOBD41"
}
}
//.....
]
but when i use keyword, i get nothing:
GET /webapi-2022.07.28/_search
{
"query": {
"term": {
"message.keyword": "ERROR"
}
}
}
i tried term and match, the result is same.
this is happening due to message field not just containing ERROR but also having other string in the .keyword field, you need to use the text search only in your case, you can use the .keyword field only in case of the exact search.
If your message field contained only the ERROR string than only searching on your .keyword would produce result, you can test it yourself by indexing a sample document.

How to Query elasticsearch index with nested and non nested fields

I have an elastic search index with the following mapping:
PUT /student_detail
{
"mappings" : {
"properties" : {
"id" : { "type" : "long" },
"name" : { "type" : "text" },
"email" : { "type" : "text" },
"age" : { "type" : "text" },
"status" : { "type" : "text" },
"tests":{ "type" : "nested" }
}
}
}
Data stored is in form below:
{
"id": 123,
"name": "Schwarb",
"email": "abc#gmail.com",
"status": "current",
"age": 14,
"tests": [
{
"test_id": 587,
"test_score": 10
},
{
"test_id": 588,
"test_score": 6
}
]
}
I want to be able to query the students where name like '%warb%' AND email like '%gmail.com%' AND test with id 587 have score > 5 etc. The high level of what is needed can be put something like below, dont know what would be the actual query, apologize for this messy query below
GET developer_search/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": IN [587]
}
},
{
"term": {
"tests.test_score": >= some value
}
}
]
}
}
}
}
]
}
}
}
The query must be flexible so that we can enter dynamic test Ids and their respective score filters along with the fields out of nested fields like age, name, status
Something like that?
GET student_detail/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "*warb*"
}
}
},
{
"wildcard": {
"email": {
"value": "*gmail.com*"
}
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
},
"inner_hits": {}
}
}
]
}
}
}
Inner hits is what you are looking for.
You must make use of Ngram Tokenizer as wildcard search must not be used for performance reasons and I wouldn't recommend using it.
Change your mapping to the below where you can create your own Analyzer which I've done in the below mapping.
How elasticsearch (albiet lucene) indexes a statement is, first it breaks the statement or paragraph into words or tokens, then indexes these words in the inverted index for that particular field. This process is called Analysis and that this would only be applicable on text datatype.
So now you only get the documents if these tokens are available in inverted index.
By default, standard analyzer would be applied. What I've done is I've created my own analyzer and used Ngram Tokenizer which would be creating many more tokens than just simply words.
Default Analyzer on Life is beautiful would be life, is, beautiful.
However using Ngrams, the tokens for Life would be lif, ife & life
Mapping:
PUT student_detail
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"email" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age" : {
"type" : "text" <--- I am not sure why this is text. Change it to long or int. Would leave this to you
},
"status" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tests":{
"type" : "nested"
}
}
}
}
Note that in the above mapping I've created a sibling field in the form of keyword for name, email and status as below:
"name":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
Now your query could be as simple as below.
Query:
POST student_detail/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "war" <---- Note this. This would even return documents having "Schwarb"
}
},
{
"match": {
"email": "gmail" <---- Note this
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
}
}
}
]
}
}
}
Note that for exact matches I would make use of Term Queries on keyword fields while for normal searches or LIKE in SQL I would make use of simple Match Queries on text Fields provided they make use of Ngram Tokenizer.
Also note that for >= and <= you would need to make use of Range Query.
Response:
{
"took" : 233,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.7260926,
"hits" : [
{
"_index" : "student_detail",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.7260926,
"_source" : {
"id" : 123,
"name" : "Schwarb",
"email" : "abc#gmail.com",
"status" : "current",
"age" : 14,
"tests" : [
{
"test_id" : 587,
"test_score" : 10
},
{
"test_id" : 588,
"test_score" : 6
}
]
}
}
]
}
}
Note that I observe the document you've mentioned in your question, in my response when I run the query.
Please do read the links I've shared. It is vital that you understand the concepts. Hope this helps!

Elasticsearch - Can't search using suggestion field (“is not a completion suggest field”)

I'm completely new to elasticsearch and I'm trying to use elasticsearch completion suggester on an existing field called "identity.full_name", index = "search" and type = "person".
I followed the below index to change the mappings of the field.
1)
POST /search/_close
2)
POST search/person/_mapping
{
"person": {
"properties": {
"identity.full_name": {
"type": "text",
"fields":{
"suggest":{
"type":"completion"
}
}
}
}
}
}
3)
POST /search/_open
When I check the mappings at this point, using
GET search/_mapping/person/field/identity.full_name
I get the result,
{
"search": {
"mappings": {
"person": {
"identity.full_name": {
"full_name": "identity.full_name",
"mapping": {
"full_name": {
"type": "text",
"fields": {
"completion": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
},
"keyword": {
"type": "keyword",
"ignore_above": 256
},
"suggest": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
}
}
}
}
}
}
}
}
}
Which is suggesting that it has been updated to be a completion field.
However, when I'm querying to check if this works using,
GET search/person/_search
{
"suggest": {
"person-suggest" : {
"prefix" : "EMANNUEL",
"completion" : {
"field" : "identity.full_name"
}
}
}
}
It is giving me the error "Field [identity.full_name] is not a completion suggest field"
I'm not sure why I'm getting this error. Is there anything else I can try?
sample data:
{
"_index": "search",
"_type": "person",
"_id": "3106105149",
"_score": 1,
"_source": {
"identity": {
"id": "3106105149",
"first_name": "FLORENT",
"last_name": "TEBOUL",
"full_name": "FLORENT TEBOUL"
}
}
}
{
"_index": "search",
"_type": "person",
"_id": "125296353",
"_score": 1,
"_source": {
"identity": {
"id": "125296353",
"first_name": "CHRISTINA",
"last_name": "BHAN",
"full_name": "CHRISTINA K BHAN"
}
}
}
so when I do a GET based on prefix "CHRISTINA"
GET search/person/_search
{
"suggest": {
"person-suggest" : {
"prefix" : "CHRISTINA",
"completion" : {
"field" : "identity.full_name.suggest"
}
}
}
}
I'm getting all the results like a match_all query.
You should use it like
GET search/person/_search
{
"suggest": {
"person-suggest" : {
"prefix" : "EMANNUEL",
"completion" : {
"field" : "identity.full_name.suggest"
}
}
}
}
Mapping for GET search/_mapping/person/field/identity.full_name
{
"search" : {
"mappings" : {
"person" : {
"identity.full_name" : {
"full_name" : "identity.full_name",
"mapping" : {
"full_name" : {
"type" : "text",
"fields" : {
"suggest" : {
"type" : "completion",
"analyzer" : "simple",
"preserve_separators" : true,
"preserve_position_increments" : true,
"max_input_length" : 50
}
}
}
}
}
}
}
}
}

How Elasticsearch relevance score gets calculated?

I am using multi_match with phrase_prefix for full text search in Elasticsearch 5.5. ES query looks like
{
query: {
bool: {
must: {
multi_match: {
query: "butt",
type: "phrase_prefix",
fields: ["item.name", "item.keywords"],
max_expansions: 10
}
}
}
}
}
I am getting following response
[
{
"_index": "items_index",
"_type": "item",
"_id": "2",
"_score": 0.61426216,
"_source": {
"item": {
"keywords": "amul butter, milk, butter milk, flavoured",
"name": "Flavoured Butter"
}
}
},
{
"_index": "items_index",
"_type": "item",
"_id": "1",
"_score": 0.39063013,
"_source": {
"item": {
"keywords": "amul butter, milk, butter milk",
"name": "Butter Milk"
}
}
}
]
Mappings is as follows(I am using default mappings)
{
"items_index" : {
"mappings" : {
"parent_doc": {
...
"properties": {
"item" : {
"properties" : {
"keywords" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
How item with "name": "Flavoured Butter" getting higher score of 0.61426216 against the document with "name": "Butter Milk" and score 0.39063013?
I tried applying boost to "item.name" and removing "item.keywords" form search fields getting same results.
How scores in Elasticsearch works? Are above results correct in terms of relavance?
The scoring for phrase_prefix is similar to that of best_fields, meaning that score of a document is the score obtained from the best_field, which here is item.keywords.
So, item.name isn't adding to score
Refer: multi-match-types
You can use 2 multi_match queries to combine the score from keywords and name.
{
"query": {
"bool": {
"must": [{
"multi_match": {
"query": "butt",
"type": "phrase_prefix",
"fields": [
"item.keywords"
],
"max_expansions": 10
}
},{
"multi_match": {
"query": "butt",
"type": "phrase_prefix",
"fields": [
"item.name"
],
"max_expansions": 10
}
}]
}
}
}

Resources