Elasticsearch - Count of matches per document - elasticsearch

I'm using this query to search a field for occurrences of phrases.
"query": {
"match_phrase": {
"content": "my test phrase"
}
}
I need to calculate how many matches occurred for each phrase per document (if this is even possible?)
I've considered aggregators but think these don't meet the requirements as these will give me the number of matches over the whole index not per document.
Thanks.

This can be achieved by using Script Fields /painless script.
You can count the number of occurrences per field and add it up for the document.
Example:
## Here's my test index with some sample values
POST t1/doc/1 <-- this has one occurence
{
"content" : "my test phrase"
}
POST t1/doc/2 <-- this document has 5 occurences
{
"content": "my test phrase ",
"content1" : "this is my test phrase 1",
"content2" : "this is my test phrase 2",
"content3" : "this is my test phrase 3",
"content4" : "this is my test phrase 4"
}
POST t1/doc/3
{
"content" : "my test new phrase"
}
Now using the script I can count the phrase match for each field. I'm counting it once per field, but you can modify script to multi match per field.
Obviously, the Drawback here is that you need to mention each and every field from the document in the script, unless there's a way to loop through doc field that i am not aware of.
POST t1/_search
{
"script_fields": {
"phrase_Count": {
"script": {
"lang": "painless",
"source": """
int count = 0;
if(doc['content.keyword'].size() > 0 && doc['content.keyword'].value.indexOf(params.phrase)!=-1) count++;
if(doc['content1.keyword'].size() > 0 && doc['content1.keyword'].value.indexOf(params.phrase)!=-1) count++;
if(doc['content2.keyword'].size() > 0 && doc['content2.keyword'].value.indexOf(params.phrase)!=-1) count++;
if(doc['content3.keyword'].size() > 0 && doc['content3.keyword'].value.indexOf(params.phrase)!=-1) count++;
if(doc['content4.keyword'].size() > 0 && doc['content4.keyword'].value.indexOf(params.phrase)!=-1) count++;
return count;
""",
"params": {
"phrase": "my test phrase"
}
}
}
}
}
This will give me the phrase count per document as a scripted field
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [
{
"_index" : "t1",
"_type" : "doc",
"_id" : "2",
"_score" : 1.0,
"fields" : {
"phrase_Count" : [
5 <--- count of occurrences of the phrase in the document
]
}
},
{
"_index" : "t1",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"phrase_Count" : [
1
]
}
},
{
"_index" : "t1",
"_type" : "doc",
"_id" : "3",
"_score" : 1.0,
"fields" : {
"phrase_Count" : [
0
]
}
}
]
}
}

You can use Term Vectors to achieve this functionality. Please have a look
Term Vectors

Related

Elasticsearch: how to find a document by number in logs

I have an error in kibana
"The length [2658823] of field [message] in doc[235892]/index[mylog-2023.02.10] exceeds the [index.highlight.max_analyzed_offset] limit [1000000]. To avoid this error, set the query parameter [max_analyzed_offset] to a value less than index setting [1000000] and this will tolerate long field values by truncating them."
I know how to deal with it (change "index.highlight.max_analyzed_offset" for an index, or set the query parameter), but I want to find the document with long field and examine it.
If i try to find it by id, i get this:
q:
GET mylog-2023.02.10/_search
{
"query": {
"terms": {
"_id": [ "235892" ]
}
}
}
a:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
q:
GET mylog-2023.02.10/_doc/235892
a:
{ "_index" : "mylog-2023.02.10", "_type" : "_doc", "_id" :
"235892", "found" : false }
Maybe this number (doc[235892]) is not id? How can i find this document?
try use Query IDs:
GET /_search
{
"query": {
"ids" : {
"values" : ["1", "4", "100"]
}
}
}

Convert two repeated values in array into a string

I have some old documents where a field has an array of two vales repeated, something like this:
"task" : [
"first_task",
"first_task"
],
I'm trying to convert this array into a string because it's the same value. I've seen the following script: Convert array with 2 equal values to single value but in my case, this problem can't be fixed through logstash because it happens just with old documents stored.
I was thinking to do something like this:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"script": {
"description": "Change task field from array to first element of this one",
"lang": "painless",
"source": """
if (ctx['task'][0] == ctx['task'][1]) {
ctx['task'] = ctx['task'][0];
}
"""
}
}
]
},
"docs": [
{
"_index" : "tasks",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"#timestamp" : "2022-05-03T07:33:44.652Z",
"task" : ["first_task", "first_task"]
}
}
]
}
The result document is the following:
{
"docs" : [
{
"doc" : {
"_index" : "tasks",
"_type" : "_doc",
"_id" : "1",
"_source" : {
"#timestamp" : "2022-05-03T07:33:44.652Z",
"task" : "first_task"
},
"_ingest" : {
"timestamp" : "2022-05-11T09:08:48.150815183Z"
}
}
}
]
}
We can see the task field is reassigned and we have the first element of the array as a value.
Is there a way to manipulate actual data from Elasticsearch and convert all the documents with this characteristic using DSL queries?
Thanks.
You can achieve this with _update_by_query endpoint. Here is an example:
POST tasks/_update_by_query
{
"script": {
"source": """
if (ctx._source['task'][0] == ctx._source['task'][1]) {
ctx._source['task'] = ctx._source['task'][0];
}
""",
"lang": "painless"
},
"query": {
"match_all": {}
}
}
You can remove the match_all query if you want to update all documents or you can filter documents by chaning the conditions in the query.
Keep in mind that running a script to update all documents in the index may cause some performance issues while the update process is running.

QuickSight or Elasticsearch - Column wise aggregration

Is this possible to do in QuickSight or Elasticsearch? I have tried calculated fields in QuickSight and runtime scripts in Elasticsearch, not sure how to do it? Also, is what I'm not what I'm expecting is even possible in this tool.
Trying out a simple date difference between columns based on their action, here... "Time taken for 'creating a post' after a user registered"
Data Input:
Data output
It is possible using scripted metric aggregation
Data
"hits" : [
{
"_index" : "index121",
"_type" : "_doc",
"_id" : "aqJ3HnoBF6_U07qsNY-s",
"_score" : 1.0,
"_source" : {
"user" : "Jen",
"activity" : "Logged In",
"activity_Time" : "2020-01-08"
}
},
{
"_index" : "index121",
"_type" : "_doc",
"_id" : "a6J3HnoBF6_U07qsXY_8",
"_score" : 1.0,
"_source" : {
"user" : "Jen",
"activity" : "Created a post",
"activity_Time" : "2020-05-08"
}
},
{
"_index" : "index121",
"_type" : "_doc",
"_id" : "bKJ3HnoBF6_U07qsk4-0",
"_score" : 1.0,
"_source" : {
"user" : "Mark",
"activity" : "Logged In",
"activity_Time" : "2020-01-03"
}
},
{
"_index" : "index121",
"_type" : "_doc",
"_id" : "baJ3HnoBF6_U07qsu48g",
"_score" : 1.0,
"_source" : {
"user" : "Mark",
"activity" : "Created a post",
"activity_Time" : "2020-01-08"
}
}
]
Query
{
"size": 0,
"aggs": {
"user": {
"terms": {
"field": "user.keyword",
"size": 10000
},
"aggs": {
"distinct_sum_feedback": {
"scripted_metric": {
"init_script": "state.docs = []",
"map_script": """ Map span = [
'timestamp':doc['activity_Time'],
'activity':doc['activity.keyword'].value
];
state.docs.add(span)
""",
"combine_script": "return state.docs;",
"reduce_script": """
def all_docs = [];
for (s in states)
{
for (span in s) {
all_docs.add(span);
}
}
all_docs.sort((HashMap o1, HashMap o2)->o1['timestamp'].getValue().toInstant().toEpochMilli().compareTo(o2['timestamp'].getValue().toInstant().toEpochMilli()));
Hashtable result= new Hashtable();
boolean found = false;
JodaCompatibleZonedDateTime loggedIn;
for (s in all_docs)
{
if(s.activity =='Logged In')
{
loggedIn=s.timestamp.getValue();
found= true;
}
if(s.activity =='Created a post' && found==true)
{
found=false;
def dt=loggedIn.getYear()+ '-' + loggedIn.getMonth() + '-' + loggedIn.getDayOfMonth();
def diff= s.timestamp.getValue().toInstant().toEpochMilli() - loggedIn.toInstant().toEpochMilli();
if(result.get(dt) == null)
{
result.put(dt, diff / 1000 / 60 / 60 / 24 )
}
}
}
return result;
"""
}
}
}
}
}
}
Result
"user" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Jen",
"doc_count" : 2,
"distinct_sum_feedback" : {
"value" : {
"2020-JANUARY-8" : 121
}
}
},
{
"key" : "Mark",
"doc_count" : 2,
"distinct_sum_feedback" : {
"value" : {
"2020-JANUARY-3" : 5
}
}
}
]
}
Explanation
"init_script":
Executed prior to any collection of documents. Allows the aggregation
to set up any initial state.
Have declared a Map"
"map_script"
Executed once per document collected
Loop through all document and add activity and timestamp to map
combine_script
Executed once on each shard after document collection is complete
Return collection of Map for all shards
reduce_script
Executed once on the coordinating node after all shards have returned their results
Once again go through through all Map and create a single collection and sort on timestamp. Then go through sorted Map and insert logged in and next "created post" time (diff of logged in and post created time)

Elasticsearch sort results from several indexes so that one index has priority

I have 6 websites, lets call them A, B, C, D, E & M. M is the master website because from it you can search the contents of others, this I've done easily by using putting all indexes separated by comma in the search query.
However I have a new requirement now, that from every website you can search all websites(easy to do, apply solution from M to all), BUT give priority to results from the current website.
So If I'm searching from C, first results should be from C and then from others based on score.
Now, how do I give results from one index priority over the rest?
A boosting query serves this purpose well:
Sample data
POST /_bulk
{"index":{"_index":"a"}}
{"message":"First website"}
{"index":{"_index":"b"}}
{"message":"Second website"}
{"index":{"_index":"c"}}
{"message":"Third website"}
{"index":{"_index":"d"}}
{"message":"Something irrelevant"}
Query
POST /a,b,c,d/_search
{
"query": {
"boosting": {
"positive": {
"match": {
"message": "website"
}
},
"negative": {
"terms": {
"_index": ["b", "c", "d"]
}
},
"negative_boost": 0.2
}
}
}
Response
{
...
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "a",
"_type" : "_doc",
"_id" : "sx-DkWsBHWmGEbsYwViS",
"_score" : 0.2876821,
"_source" : {
"message" : "First website"
}
},
{
"_index" : "b",
"_type" : "_doc",
"_id" : "tB-DkWsBHWmGEbsYwViS",
"_score" : 0.05753642,
"_source" : {
"message" : "Second website"
}
},
{
"_index" : "c",
"_type" : "_doc",
"_id" : "tR-DkWsBHWmGEbsYwViS",
"_score" : 0.05753642,
"_source" : {
"message" : "Third website"
}
}
]
}
}
Notes
The smaller you make the negative_boost, the more likely it is that results from the "active index" will win out over the other indices
If you set the negative_boost to 0, you will guarantee that the "active site" results sort first, but you will discard all scores for all the other sites, so the remaining sort will be arbitrary.
I reckon something like negative_boost: 0.1, which is an order-of-magnitude adjustment on relevance, should get you what you're looking for.

Elasticsearch Top 10 Most Frequent Values In Array Across All Records

I have an index "test". Document structure is as shown below. Each document has an array of "tags". I am not able to figure out how to query this index to get top 10 most frequently occurring tags?
Also, what are the best practices one should follow if we have more than 2mil docs in this index?
{
"_index" : "test",
"_type" : "data",
"_id" : "1412879673545024927_1373991666",
"_score" : 1.0,
"_source" : {
"instagramuserid" : "1373991666",
"likes_count" : 163,
"#timestamp" : "2017-06-08T08:52:41.803Z",
"post" : {
"created_time" : "1482648403",
"comments" : {
"count" : 9
},
"user_has_liked" : true,
"link" : "https://www.instagram.com/p/BObjpPMBWWf/",
"caption" : {
"created_time" : "1482648403",
"from" : {
"full_name" : "PARAMSahib ™",
"profile_picture" : "https://scontent.cdninstagram.com/t51.2885-19/s150x150/12750236_1692144537739696_350427084_a.jpg",
"id" : "1373991666",
"username" : "parambanana"
},
"id" : "17845953787172829",
"text" : "This feature talks about how to work pastels .\n\nDull gold pullover + saffron khadi kurta + baby pink pants + Deep purple patka and white sneakers - Perfect colours for a Happy sunday christmas morning . \n#paramsahib #men #menswear #mensfashion #mensfashionblog #mensfashionblogger #menswearofficial #menstyle #fashion #fashionfashion #fashionblog #blog #blogger #designer #fashiondesigner #streetstyle #streetfashion #sikh #sikhfashion #singhstreetstyle #sikhdesigner #bearded #indian #indianfashionblog #indiandesigner #international #ootd #lookbook #delhistyleblog #delhifashionblog"
},
"type" : "image",
"tags" : [
"men",
"delhifashionblog",
"menswearofficial",
"fashiondesigner",
"singhstreetstyle",
"fashionblog",
"mensfashion",
"fashion",
"sikhfashion",
"delhistyleblog",
"sikhdesigner",
"indianfashionblog",
"lookbook",
"fashionfashion",
"designer",
"streetfashion",
"international",
"paramsahib",
"mensfashionblogger",
"indian",
"blog",
"mensfashionblog",
"menstyle",
"ootd",
"indiandesigner",
"menswear",
"blogger",
"sikh",
"streetstyle",
"bearded"
],
"filter" : "Normal",
"attribution" : null,
"location" : null,
"id" : "1412879673545024927_1373991666",
"likes" : {
"count" : 163
}
}
}
},
If your tags type in mapping is object (which is by default) you can use an aggregation query like this:
{
"size": 0,
"aggs": {
"frequent_tags": {
"terms": {"field": "post.tags"}
}
}
}

Resources