combine output of a a first filter as input of a second filter - elasticsearch

We have an elasticsearch instance with entries with two tagged fields.
sessionid
message
In a first filter, I find all entries where the message contains a certain substring. Each of those entries contains a sessionid,
In a second filter, I want to find all messages, where the sessionid matches one of the sessionids returned by the first filter. This filter should go through all entries a second time.
Example, in the log below (sessionid;message)
1234;miss 1
2456;miss 2
1234;match
When filtering for the string "match" in the message part, I would get as output of the combined query:
1234;miss 1
1234;match
We are using KQL.
Background: We want an easy way to follow complete flows with an error-string in a message, in a multithreaded environment.

I understand why you'd want to do that in one go but it's not possible in ElasticSearch. You cannot "revisit" documents which you've already ruled out by a different query -- searching for match would disqualify all misss.
It's unfortunate you have the log message combined with the ID but you can try this:
Find all that match match (pun intended) -- I'm assuming you do have a keyword field available
GET your_index/_search
{
"query": {
"regexp": {
"separated_msg.keyword": ".*\\;match.*"
}
}
}
Post-process the hits and extract the session IDs
Run session ID matching:
GET your_index/_search
{
"query": {
"regexp": {
"separated_msg.keyword": "1234;.*"
}
}
}
or on multiple IDs using a bool should:
GET your_index/_search
{
"query": {
"bool": {
"should": [
{
"regexp": {
"separated_msg.keyword": "1234;.*"
}
},
{
"regexp": {
"separated_msg.keyword": "4567;.*"
}
}
]
}
}
}

If a unique numeric value can be assigned to each message ex 1 for "match", 2 for "miss 1" then bucket selector and top_hits can be used.
{
"size": 0,
"aggs": {
"sessionid": {
"terms": {
"field": "sessionid", --> first get all unique sessionids
"size": 10
},
"aggs": {
"documents":{
"top_hits": {
"size": 10
}
},
"messageid": {
"terms": {
"field": "messageid", ---> get unique sessionId
"size": 10
},
"aggs": {
"matching_messageid": { ---> select a bucket with key(message Id) as 2
"bucket_selector": {
"buckets_path": {
"key": "_key"
},
"script": "params.key==2"
}
}
}
},
"my_bucket": {
"bucket_selector": {
"buckets_path": {
"hits": "messageid._bucket_count"
},
"script": "params.hits>0"--> if bucket not empty then consider that sessionid
}
}
}
}
}
}
Result
"aggregations" : {
"sessionid" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1234,
"doc_count" : 2,
"documents" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index31",
"_type" : "_doc",
"_id" : "MTAYpnABheSAx2q_eNEF",
"_score" : 1.0,
"_source" : {
"sessionid" : 1234,
"message" : "miss 1",
"messageid" : 1
}
},
{
"_index" : "index31",
"_type" : "_doc",
"_id" : "MjAYpnABheSAx2q_n9FW",
"_score" : 1.0,
"_source" : {
"sessionid" : 1234,
"message" : "match",
"messageid" : 2
}
}
]
}
},
"messageid" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 2,
"doc_count" : 1
}
]
}
}
]
}
}
If a given message has timestamp(max/min) then max_path can be used to select buckets with given messages.
The best approach to above problem will be to use nested documents
{
"sessionid":1234,
"messages":[
{
"message":"match"
},
{
"message":"miss 1"
}
]
}
````
then the problem can be resolved by nested query. If logstash is used then above structure can generated while indexing.

Related

multisearch API via curl

I've been looking at the documentation for multisearch API with the objective of exporting specific values of a field in Elasticsearch for a given time period.
I still haven't figured out a way of getting all the results of fieldA for the past 24h while applying a filter of filter: KEY
Is this possible to do via curl request to the Elasticsearch endpoint? Running 7.7.0.
You can use term query to filter on values and a range query to get values more than a date.
Terms aggregation will give all values for a field. If you just need documents you can skip this part.
Query:
{
"query": {
"bool": {
"filter": [
{ --> to filer on a value
"term": {
"fieldA.keyword": "A"
}
},
{
"range": {
"timestamp": {
"gte": "now-24h/h" --> within 24 hr from now
}
}
}
]
}
},
"aggs": {
"fieldA": {
"terms": { --> term aggregation
"field": "fieldA.keyword",
"size": 10
}
}
}
}
Result:
"hits" : [
{
"_index" : "index57",
"_type" : "_doc",
"_id" : "Uf4aOnIBRc7WtBUiRs6e",
"_score" : 0.0,
"_source" : {
"timestamp" : "2020-05-21",
"fieldA" : "A"
}
}
]
},
"aggregations" : {
"fieldA" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "A",
"doc_count" : 1
}
]
}
}

ElasticSearch query fields in disabled object

I have an Elastic Search 6.8.7 cluster.
I have a column with this mapping:
"event_object": { "enabled": false, "type": "object" }
I want to search for records that match certain other criteria, and also have a particular value for a particular field field in this object.
So far, I have tried variations of doing a normal search for the indexed fields, and a filter script for the unindexed ones:
GET /my_index/_search
{
"query":{
"bool":{
"must":{
"query_string": {
"query": "foo:bar"
}
},
"filter": {
"script": {
"script": {
"source": "doc[\"event_object\"][\"state\"].value == \"R\""
}
}
}
}
},
"terminate_after":1000,
"from":0,
"size":1000
}
Which is a hodgepodge of testing myself forwards based on google searches. But I can't get things to even compile, let alone run and filter.
It is not possible to access the content of JSON objects that have enabled: false. From the official documentation:
Elasticsearch skips parsing of the contents of the field entirely. The JSON can still be retrieved from the _source field, but it is not searchable or stored in any other way
So even scripting will not help here.
However, there's one way to access this disabled data from scripting in a terms aggregation (using the include parameter and a top_hitssub-aggregation):
POST test/_search
{
"query": {
"match_all": {}
},
"aggs": {
"state": {
"terms": {
"script": "params._source.event_object.state",
"size": 100,
"include": "R"
},
"aggs": {
"hits": {
"top_hits": {
"size": 10
}
}
}
}
}
}
And you'd get a response like this one:
"aggregations" : {
"state" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "R",
"doc_count" : 1,
"hits" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"event_object" : {
"state" : "R"
},
"test" : "hello"
}
}
]
}
}
}
]
}
}

Is there a way to get related documents if a match occurs after the query?

I am currently doing a fuzzy name search on some documents. These documents can be related to each other (for example name field of one document may contain the name and another may contain the alias for the same person). I will give these documents the same unique identifier. My question is, can I get the documents with same unique identifier if a match occurs in any of them?
Suppose that there are 4 documents like this.
{
{
"name": "Bob"
"uid": "1"
},
{
"name": "Bilbo"
"uid": "1"
},
{
"name": "Jack"
"uid": "2"
},
{
"name": "Mary"
"uid" : "3"
}
}
When I query name "Bob", I expect to get both documents with "uid" = "1"
{
{
"name": "Bob"
"uid": "1"
},
{
"name": "Bilbo"
"uid": "1"
}
}
Elasticsearch doesn't have concept of JOINS. So documents cannot be fetched by joining on "uid"
1. Using two queries
i. Get documents with name "Bob"
{
"query": {
"term": {
"name.keyword": {
"value": "Bob"
}
}
}
}
ii. Fetch documents using above returned ids.
2. Using terms and bucket selector aggregation
Mapping:
{
"<mapping_name>" : {
"mappings" : {
"properties" : {
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"uid" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
Query:
1. Create a bucket(collection) of uid.
2. Create sub bucket of name which includes only "Bob" so uid 1 will have a bucket of key Bob , uid 2 will be empty
3. Use bucket_selector aggregation to select where count of sub bucket name is greater than equal to 1. This will remove uid 2
4. Use top_hits aggregation to get documents.
{
"size": 0,
"aggs": {
"uid": {
"terms": {
"field": "uid.keyword",
"size": 10
},
"aggs": {
"documents":{
"top_hits": { --> to get documents under parent term
"size": 10
}
},
"name": {
"terms": {
"field": "name.keyword", --> terms need non_analyzed field so keyword
"include":"Bob", --> get terms with name bob
"size": 10
}
},
"my_bucket":{
"bucket_selector": { --> select buckets which have atleast one name
"buckets_path": {"count":"name._bucket_count"},
"script": "if(params.count>=1) return true;"
}
}
}
}
}
}
Result: All docuents with uid 1(same uid as "Bob") are returned
"aggregations" : {
"uid" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 2,
"documents" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index61",
"_type" : "_doc",
"_id" : "uCP1-nAB_Wo5RvhlZM6k",
"_score" : 1.0,
"_source" : {
"name" : "Bob",
"uid" : "1"
}
},
{
"_index" : "index61",
"_type" : "_doc",
"_id" : "uSP1-nAB_Wo5Rvhlbc4S",
"_score" : 1.0,
"_source" : {
"name" : "Bilbo",
"uid" : "1"
}
}
]
}
},
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Bob",
"doc_count" : 1
}
]
}
}
]
}
}

Aggregation on .keyword to return only the keys that contain a specific string

New to aggregations in elasticsearch. Using 7.2. I am trying to write an aggregation on Tree.keyword to only return the count of documents that have a key that contains the word "Branch". I have tried sub aggregations, bucket_selector (which doesnt work for key strings) and scripts. Anyone have any ideas or suggestions on how to approach this?
Mapping:
{
"testindex" : {
"mappings" : {
"properties" : {
"Tree" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
}
}
}
Example Query that returns all the keys but what I need to do is limit to only return keys with "Branch" or better yet just the count of how many "Branch" keys there are:
GET testindex/_search
{
"aggs": {
"bucket": {
"terms": {
"field": "Tree.keyword"
}
}
}
}
Returns:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "testindex",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"Tree" : [
"Car:76",
"Branch:yellow",
"Car:one",
"Branch:blue"
]
}
}
]
},
"aggregations" : {
"bucket" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Car:76",
"doc_count" : 1
},
{
"key" : "Branch:yellow",
"doc_count" : 1
},
{
"key" : "Car:one",
"doc_count" : 1
},
{
"key" : "Branch:blue",
"doc_count" : 1
}
]
}
}
}
You have to add includes for limit result. Here's the code sample and hopefully this should help you.
GET testindex/_search
{
"_source": {
"includes": [
"Branch"
]
},
"aggs": {
"bucket": {
"terms": {
"field": "Tree.keyword"
}
}
}
}
It is possible to filter the values for which buckets will be created. This can be done using the include and exclude parameters which are based on regular expression strings or arrays of exact values. Additionally, include clauses that can filter using partition expressions.
For your case, it should be like this,
GET testindex/_search
{
"aggs": {
"bucket": {
"terms": {
"field": "Tree.keyword",
"include": "Branch:*"
}
}
}
}
Thanks for all the help! Unfortunately, none of those solutions worked for me. I ended up using a script to return all the branches and then setting everything else into a new key. Then used a bucket script to subtract 1 in Total_Buckets. Probably a better solution out there but hopefully it helps someone
GET testindex/_search
{
"aggs": {
"bucket": {
"cardinality": {
"field": "Tree.keyword",
"script": {
"lang": "painless",
"source": "if(_value.contains('Branches:')) { return _value} return 1;"
}
}
},
"Total_Branches": {
"bucket_script": {
"buckets_path": {
"my_var1": "bucket.value"
},
"script": "return params.my_var1-1"
}
}
}
}

ElasticSearch get last n distinct records

I am trying to implement a search query over records stored in elasticsearch.
The record structure looks something like this.
{
"_index" : "box_info_store",
"_type" : "boxes",
"_id" : "pWjQLWkBIJk0ORjd0X2P",
"_score" : null,
"_source" : {
"transactionID" : "60ab66cf24c9924f562bf1a2b5d92305d0a6",
"boxNumber" : "Box3",
"createDate" : "2013-09-17T00:00:00",
"itemNumber" : "Item1",
"address" : "Sample Address"
}
}
one box can contain multiple items. For example Box3 can have Item1, Item2 and Item3. So in elasticsearch i will have 3 different documents. Also at the same time, same box and same item can also exist but with different address. The transactionID may or maynot be the same for these documents.
My requirement is to fetch last n recent and distinct transactionIDs, along with their records.
I tried following query to fetch last 7 distinct transactionIDs
GET /box_info_store/boxes/_search?size=7
{
"query": {
"bool": {
"must": [
{"match":{"boxNumber":"Box3"}},
{"match":{"itemNumber":"Item1"}}
]
}
},
"sort": [
{
"createDate": {
"order": "desc"
}
}
],
"aggs": {
"distinct_transactions": {
"terms": { "field": "transactionID"}
}
}
}
This fetched me last 7 documents where boxNumber is Box3 and itemNumber is Item1, but not 7 distinct transactionIDs, two out of these seven documents have the same transactionID(both having separate address though).
But my requirement is to get 7 distinct transactionIds, no matter how many document it returns.
Hope i was able to explain myself.
Appreciate any kind of help here
Thanks
------Edited #gaurav9620, i ran the first query and got count as 32, then i ran the second query with distinct count as 3 i got the following result
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 32,
"max_score" : null,
"hits" : [
{
"_index" : "box_info_store",
"_type" : "boxes",
"_id" : "RWjRLWkBIJk0ORjdEX-L",
"_score" : null,
"_source" : {
"transactionID" : "3087e106244f6247a5290fb21ce64254529c",
"boxNumber" : "Box3",
"createDate" : "2017-11-15T00:00:00",
"itemNumber" : "Item1",
"address" : "sampleAddress12",
},
"sort" : [
1510704000000
]
},
{
"_index" : "box_info_store",
"_type" : "boxes",
"_id" : "MGjQLWkBIJk0ORjdwX0M",
"_score" : null,
"_source" : {
"transactionID" : "60ab66cf24c9924f562bf1a2b5d92305d0a6",
"boxNumber" : "Box3",
"createDate" : "2016-04-03T00:00:00",
"itemNumber" : "Item1",
"address" : "sampleAddress321",
},
"sort" : [
1459641600000
]
},
..........
..........
..........
{
"_index" : "box_info_store",
"_type" : "boxes",
"_id" : "AGjRLWkBIJk0ORjdK4CJ",
"_score" : null,
"_source" : {
"transactionID" : "3087e106244f6247a5290fb21ce64254529c",
"boxNumber" : "Box3",
"createDate" : "1996-02-16T00:00:00",
"itemNumber" : "Item1",
"address" : "sampleAddress4324",
},
"sort" : [
824428800000
]
}
]
},
"aggregations" : {
"unique_transactions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 16,
"buckets" : [
{
"key" : "3087e106244f6247a5290fb21ce64254529c",
"doc_count" : 6
},
{
"key" : "27c5f3422f4482495d29e7b2c15c0e311743",
"doc_count" : 5
},
{
"key" : "c40e53212e74e24bf02a5bd2b134cf92bffb",
"doc_count" : 5
}
]
}
}
}
The size which you have used : represents number of raw documents that are retrieved.
If your case what you need to do is :
Mention size as 0 -> which will return you no raw documents
Include a size parameter in aggregation which will return you unique 7 ids.
GET /box_info_store/boxes/_search?size=7
{
"query": {
"bool": {
"must": [
{
"match": {
"boxNumber": "Box3"
}
},
{
"match": {
"itemNumber": "Item1"
}
}
]
}
},
"sort": [
{
"createDate": {
"order": "desc"
}
}
],
"aggs": {
"distinct_transactions": {
"terms": {
"field": "transactionID",
"size": 7
}
}
}
}
EDIT-------------------------------------
First fire this query
GET /box_info_store/boxes/_search?size=0
{
"query": {
"bool": {
"must": [
{
"match": {
"boxNumber": "Box3"
}
},
{
"match": {
"itemNumber": "Item1"
}
}
]
}
}
}
Here you will find total number of documents matching your query which you can set as n
After this fire your query as below
GET /box_info_store/boxes/_search?size=**n**
{
"query": {
"bool": {
"must": [
{
"match": {
"boxNumber": "Box3"
}
},
{
"match": {
"itemNumber": "Item1"
}
}
]
}
},
"sort": [
{
"createDate": {
"order": "desc"
}
}
],
"aggs": {
"distinct_transactions": {
"terms": {
"field": "transactionID",
"size": NUMBER_OF_UNIQUE_TRANSACTION_IDS_TO_BE_FETCHED
}
}
}
}

Resources