How to query all content from a field in Elasticssearch - elasticsearch

I'm queriying data from Elasticsearch with python. I can query a certain value in a field like this:
GET index/_search
{
"query": {
"match" : {
"somefieldname": "somevalue"
}
}
}
But how can I query all values inside the field somefieldname?
UPDATE:
Here's an example index:
"_index" : „indexname
"_type" : "_doc",
"_id" : "lJlcO3wBhlKWxmXE9jrd",
"_score" : 0,
"_source": {
„field1“: „abc“,
„field2“: „123",
„field3": „def“,
},
"_index" : „indexname
"_type" : "_doc",
"_id" : "lJlcO3wBhlKWxmXE9jrd",
"_score" : 0,
"_source": {
„field1“: „fgh“,
„field2“: „654",
„field3": „kui“,
},
"_index" : „indexname
"_type" : "_doc",
"_id" : "lJlcO3wBhlKWxmXE9jrd",
"_score" : 00,
"_source": {
„field1“: „567“,
„field2“: „gfr",
„field3": „234“,
},
Now I want to query all content from field2 from all docs. So that my output is [„123", „654", „gfr"]
UPDATE:
Index mapping for the field:
{
"myindex" : {
"mappings" : {
"field2" : {
"full_name" : "field2",
"mapping" : {
"field2" : {
"type" : "keyword"
}
}
}
}
}
}

You can use terms aggregation, to get unique values from field2
{
"size": 0,
"aggs": {
"field2values": {
"terms": {
"field": "field2"
}
}
}
}
Search Result would be
"aggregations": {
"field2values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "123",
"doc_count": 1
},
{
"key": "654",
"doc_count": 1
},
{
"key": "gfr",
"doc_count": 1
}
]
}
}

Related

Search by multiple values using NEST(ElasticSearch)

I have an index called "campaigns" with these records:
"hits" : [
{
"_index" : "campaigns",
"_id" : "cf08b05c-c8b5-45cb-bca8-17267c3613fb",
"_source" : {
"PublisherId" : 1,
"CurrentStatus" : "Pending"
}
},
{
"_index" : "campaigns",
"_id" : "39436cb3-483e-4fb4-92e4-4e06ecad27a1",
"_source" : {
"PublisherId" : 1,
"CurrentStatus" : "Approved"
}
},
{
"_index" : "campaigns",
"_id" : "21436cb1-583e-4fb4-92e4-4e06ecad23a2",
"_source" : {
"PublisherId" : 1,
"CurrentStatus" : "Rejected"
}
}
]
I want to get all campaigns with "PublisherId = 1" and with any statuses between "Approved,Rejected". Something like this:
var statuses = new[] {CampaignStatus.Approved,CampaignStatus.Rejected};
campaigns.Where(c=> c.PublisherId == 1 && statuses.Contains(c.CurrentStatus)).ToList();
How can I run this query using NEST?
Expected Result:
"hits" : [
{
"_index" : "campaigns",
"_id" : "39436cb3-483e-4fb4-92e4-4e06ecad27a1",
"_source" : {
"PublisherId" : 1,
"CurrentStatus" : "Approved"
}
},
{
"_index" : "campaigns",
"_id" : "39436cb3-483e-4fb4-92e4-4e06ecad27a1",
"_source" : {
"PublisherId" : 1,
"CurrentStatus" : "Rejected"
}
}
]
I don't know the syntax of nest but as ES is REST based , providing working example query in JSON format, which you can convert to nest code.
Index mapping
{
"mappings": {
"properties": {
"PublisherId": {
"type": "integer"
},
"CurrentStatus": {
"type": "text"
}
}
}
}
Index all three sample docs and use below search query
{
"query": {
"bool": {
"must": {
"term": {
"PublisherId": 1
}
},
"should": [
{
"match": {
"CurrentStatus": "Rejected"
}
},
{
"match": {
"CurrentStatus": "Approved"
}
}
],
"minimum_should_match" : 1
}
}
}
Search Result
"hits": [
{
"_index": "stof_63968525",
"_type": "_doc",
"_id": "1",
"_score": 1.9808291,
"_source": {
"PublisherId": 1,
"CurrentStatus": "Approved"
}
},
{
"_index": "stof_63968525",
"_type": "_doc",
"_id": "3",
"_score": 1.9808291,
"_source": {
"PublisherId": 1,
"CurrentStatus": "Rejected"
}
}
]
Please note the use of minimum_should_match which forces atleast one of status Rejected and Approved to match and refer bool query in ES to understand the query construct.
Did you try this?
QueryContainer queryAnd = new TermQuery() { Field = "PublisherId", Value = 1 };
QueryContainer queryOr = new TermQuery() { Field = "CurrentStatus", Value = "Approved" };
queryOr |= new TermQuery() { Field = "CurrentStatus", Value = "Rejected" };
QueryContainer queryMain = queryAnd & queryOr;
ISearchResponse<campaigns> searchReponse = elasticClient.Search<campaigns>(s => s
.Query(q2 => q2
.Bool(b => b
.Should(queryMain)
)));

Return field even if specific field value isn't available

I have this bool query:
{
"bool": {
"must_not": [
{
"exists": {
"field": "*multiparttype.doNotDisplay",
"boost": 1
}
}
],
"should": [
{
"exists": {
"field": "multiparttype",
"boost": 1
}
},
{
"exists": {
"field": "*multiparttype.oldValue",
"boost": 1
}
},
{
"exists": {
"field": "*multiparttype.newValue",
"boost": 1
}
}
]
}
}
It return data if ES has following structure. If a document exist like below, this query will work and return this documents
multiparttype{
oldValue: "YY",
newValue:"XXX",
type:10
}
But if document just have this:
multiparttype{
type:10
}
OR
multiparttype{
}
Above query wont return this document
How can i make it possible??
Based on your problem, you need to use a match_all which will match against all documents, which would return all documents with a score of "1.0".
The following data was in the index:
multiparttype = { "oldValue" : "versionX","newValue" : "versionY"}
multiparttype = { "oldValue" : "versionX","newValue" : "versionY"}
empty_field : "test",multiparttype : {}
multiparttype" = {"type" : "typetest"}
The following query was corrected taking into account the boost which can be changed based on the requirements.
"query": {
"bool": {
"should": [
{
"match_all": {}
},
{
"exists": {
"field": "multiparttype.oldValue",
"boost": 1
}
},
{
"exists": {
"field": "multiparttype.newValue",
"boost": 1
}
}
],
"must_not": [
{
"exists": {
"field": "*multiparttype.doNotDisplay"
}
}
]
}
}
The following response will be generated:
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : 3.0,
"hits" : [
{
"_index" : "stackoverflow-field",
"_type" : "_doc",
"_id" : "7Qg7TnQB3IIDvL59KA7i",
"_score" : 3.0,
"_source" : {
"multiparttype" : {
"oldValue" : "versionX",
"newValue" : "versionY"
}
}
},
{
"_index" : "stackoverflow-field",
"_type" : "_doc",
"_id" : "1wmWTnQB3IIDvL59lAAL",
"_score" : 1.0,
"_source" : {
"multiparttype" : {
"type" : "typetest"
}
}
},
{
"_index" : "stackoverflow-field",
"_type" : "_doc",
"_id" : "tQmbTnQB3IIDvL59Zgy7",
"_score" : 1.0,
"_source" : {
"empty_field" : "test"
}
},
{
"_index" : "stackoverflow-field",
"_type" : "_doc",
"_id" : "tQmcTnQB3IIDvL59fA8Z",
"_score" : 1.0,
"_source" : {
"empty_field" : "test",
"multiparttype" : { }
}
}
]
}
Documentation : https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-all-query.html

Skip duplicates on field in a Elasticsearch search result

Is it possible to remove duplicates on a given field?
For example the following query:
{
"query": {
"term": {
"name_admin": {
"value": "nike"
}
}
},
"_source": [
"name_admin",
"parent_sku",
"sku"
],
"size": 2
}
is retrieving
"hits" : [
{
"_index" : "product",
"_type" : "_doc",
"_id" : "central30603",
"_score" : 4.596813,
"_source" : {
"parent_sku" : "SSP57",
"sku" : "SSP57816401",
"name_admin" : "NIKE U NSW PRO CAP NIKE AIR"
}
},
{
"_index" : "product",
"_type" : "_doc",
"_id" : "central156578",
"_score" : 4.596813,
"_source" : {
"parent_sku" : "SSP57",
"sku" : "SSP57816395",
"name_admin" : "NIKE U NSW PRO CAP NIKE AIR"
}
}
]
I'd like to skip duplicates on parent_sku so I only have one result per parent_sku like it's possible with suggestion by doing something like "skip_duplicates": true.
I know I cloud achieve this with an aggregation but I'd like to stick with a search, as my query is a bit more complicated and as I'm using the scroll API which doesn't work with aggregations.
Field collapsing should help here
{
"query": {
"term": {
"name_admin": {
"value": "nike"
}
}
},
"collapse" : {
"field" : "parent_sku",
"inner_hits": {
"name": "parent",
"size": 1
}
},
"_source": false,
"size": 2
}
The above query will return one document par parent_sku.

elastic query to get events where corresponding pair is missing

I have records of transaction which follow following lifecycle.
Event when transaction is received [RCVD]
Event when transaction gets pending for execution [PNDG] (OPTIONAL step)
Event when it gets executed [SENT]
Following are the 7 sample events in the index:
{trxID: 1, status:RCVD}
{trxID: 2, status:RCVD}
{trxID: 3, status:RCVD}
{trxID: 2, status:PNDG}
{trxID: 3, status:PNDG}
{trxID: 1, status:SENT}
{trxID: 2, status:SENT}
I need to find all the transactions which went to pending state but not executed yet. In other word there should be PNDG status for transaction but not SENT.
I am trying not to do it at java layer.
I did an aggregation on trxID, and then I did sub aggregation on status.
Then I cannot figure out how to get those records where bucket has only PNDG in sub-aggregation. I am not sure if I am thinking in right direction.
The result I am expecting is trxID 3 because for this transaction ,we got PNDG status but did not get SENT yet. On the other hand TrxUD 1 should not be reported as it never went to PNDG (pending) state irrespective of if SENT status is reported of not.
You can use count of status under a transaction id.
GET index24/_search
{
"size": 0,
"aggs": {
"transactionId": {
"terms": {
"field": "trxID",
"size": 10
},
"aggs": {
"status": {
"terms": {
"field": "status.keyword",
"size": 10
}
},
"count": {
"cardinality": {
"field": "status.keyword"
}
},
"my_bucketselector": {
"bucket_selector": {
"buckets_path": {
"statusCount": "count"
},
"script": "params.statusCount==1"
}
}
}
}
}
}
Response:
"aggregations" : {
"transactionId" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 4,
"doc_count" : 1,
"count" : {
"value" : 1
},
"status" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "PNDG",
"doc_count" : 1
}
]
}
}
]
}
}
EDIT 1:
I have tried with below :-
Get max date for a transaction id and then get date under pending . If both dates are same then pending is the last status
Data:
[
{
"_index" : "index24",
"_type" : "_doc",
"_id" : "aYCs0m0BD5PlkoxXxO36",
"_score" : 1.0,
"_source" : {
"trxID" : 1,
"status" : "RCVD",
"date" : "2019-10-15T12:00:00"
}
},
{
"_index" : "index24",
"_type" : "_doc",
"_id" : "aoCs0m0BD5PlkoxX7e35",
"_score" : 1.0,
"_source" : {
"trxID" : 1,
"status" : "PNDG",
"date" : "2019-10-15T12:01:00"
}
},
{
"_index" : "index24",
"_type" : "_doc",
"_id" : "a4Ct0m0BD5PlkoxXCO06",
"_score" : 1.0,
"_source" : {
"trxID" : 1,
"status" : "SENT",
"date" : "2019-10-15T12:02:00"
}
},
{
"_index" : "index24",
"_type" : "_doc",
"_id" : "bICt0m0BD5PlkoxXQe0Y",
"_score" : 1.0,
"_source" : {
"trxID" : 2,
"status" : "RCVD",
"date" : "2019-10-15T12:00:00"
}
},
{
"_index" : "index24",
"_type" : "_doc",
"_id" : "bYCt0m0BD5PlkoxXZO2x",
"_score" : 1.0,
"_source" : {
"trxID" : 2,
"status" : "PNDG",
"date" : "2019-10-15T12:01:00"
}
},
{
"_index" : "index24",
"_type" : "_doc",
"_id" : "boCt0m0BD5PlkoxXju1H",
"_score" : 1.0,
"_source" : {
"trxID" : 3,
"status" : "RCVD",
"date" : "2019-10-15T12:00:00"
}
},
{
"_index" : "index24",
"_type" : "_doc",
"_id" : "b4Ct0m0BD5PlkoxXou0-",
"_score" : 1.0,
"_source" : {
"trxID" : 3,
"status" : "SENT",
"date" : "2019-10-15T12:01:00"
}
}
]
Query:
GET index24/_search
{
"size": 0,
"aggs": {
"transactionId": {
"terms": {
"field": "trxID",
"size": 10000
},
"aggs": {
"maxDate": {
"max": {
"field": "date" ---> get max date under transactions
}
},
"pending_status": {
"filter": {
"term": {
"status.keyword": "PNDG" ---> filter for pending
}
},
"aggs": {
"filtered_maxdate": {
"max": {
"field": "date" --> get date under pending
}
}
}
},
"buckets_latest_status_pending": { -->filter if max date==pending date
"bucket_selector": {
"buckets_path": {
"filtereddate": "pending_status>filtered_maxdate",
"maxDate": "maxDate"
},
"script": "params.filtereddate==params.maxDate"
}
}
}
}
}
}
Response:
{
"transactionId" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 2, --> only transaction id 2 is returned
"doc_count" : 2,
"pending_status" : {
"doc_count" : 1,
"filtered_maxdate" : {
"value" : 1.57114086E12,
"value_as_string" : "2019-10-15T12:01:00.000Z"
}
},
"maxDate" : {
"value" : 1.57114086E12,
"value_as_string" : "2019-10-15T12:01:00.000Z"
}
}
]
}
}
I did an aggregation on trxID, and then I did sub aggregation on status.
That's a great start !!!
Now, you can leverage the bucket_selector pipeline aggregation in order to surface only the transactions which have only 1 or 2 documents, i.e. the script condition params.eventCount < 3 makes sure to catch all buckets that have RCVD and/or PNDG documents but no SENT documents:
POST events/_search
{
"size": 0,
"aggs": {
"trx": {
"terms": {
"field": "trxID",
"size": 1000
},
"aggs": {
"count": {
"cardinality": {
"field": "status.keyword"
}
},
"not_sent": {
"bucket_selector": {
"buckets_path": {
"eventCount": "count"
},
"script": "params.eventCount < 3"
}
}
}
}
}
}
In your case, this would yield this, i.e. only event with trxID = 3:
"aggregations" : {
"trx" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 3,
"doc_count" : 2,
"count" : {
"value" : 2
}
}
]
}
}

elasticsearch groupby and filter by regex condition

It's a bit hard for me to define the question as I'm not very experienced with Elasticsearch. I'm focusing the question on my specific problem:
Assuming I have the following records:
{
id: 1
name: bla1_1.aaa
},
{
id: 1
name: bla1_2.bbb
},
{
id: 2
name: bla2_1.aaa
},
{
id: 2
name: bla2_2.aaa
}
What I want is to GET all the ids that have all of their names ending with aaa.
I was thinking about group by id and then do a regex query like so: *\.aaa so that all the name must satisfy the regex query.
On this particular example I would get id: 2 back.
How do I do it?
Let me know if there's anything I need to add to clarify the question.
RegexExp can be used.
Wildcard .* matches any character any number of times including zero
Terms aggregation will give you unique "ids" and number of docs under them.
Mapping :
PUT regex
{
"mappings": {
"properties": {
"id":{
"type":"integer"
},
"name":{
"type":"text",
"fields": {
"keyword":{
"type":"keyword"
}
}
}
}
}
}
Data:
"hits" : [
{
"_index" : "regex",
"_type" : "_doc",
"_id" : "olQXjW0BywGFQhV7k84P",
"_score" : 1.0,
"_source" : {
"id" : 1,
"name" : "bla1_1.aaa"
}
},
{
"_index" : "regex",
"_type" : "_doc",
"_id" : "o1QXjW0BywGFQhV7us6B",
"_score" : 1.0,
"_source" : {
"id" : 1,
"name" : "bla1_2.bbb"
}
},
{
"_index" : "regex",
"_type" : "_doc",
"_id" : "pFQXjW0BywGFQhV77c6J",
"_score" : 1.0,
"_source" : {
"id" : 2,
"name" : "bla2_1.aaa"
}
},
{
"_index" : "regex",
"_type" : "_doc",
"_id" : "pVQYjW0BywGFQhV7Dc6F",
"_score" : 1.0,
"_source" : {
"id" : 2,
"name" : "bla2_2.aaa"
}
}
]
Query:
GET regex/_search
{
"size":0,
"query": {
"regexp": {
"name.keyword": {
"value": ".*.aaa" ---> name ending with .aaa
}
}
},
"aggs": {
"unique_ids": {
"terms": {
"field": "id",
"size": 10
}
}
}
}
Result:
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"unique_ids" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 2, ---> 2 doc under id 2
"doc_count" : 2
},
{
"key" : 1, ----> 1 doc under id 1
"doc_count" : 1
}
]
}
}
Edit:
Using bucket selector to keep buckets where total count of docs in Id matches with docs selected in regex
GET regex/_search
{
"size": 0,
"aggs": {
"unique_ids": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"totalCount": { ---> to get total count of id(all docs)
"value_count": {
"field": "id"
}
},
"filter_agg": {
"filter": {
"bool": {
"must": [
{
"regexp": {
"name.keyword": ".*.aaa"
}
}
]
}
},
"aggs": {
"finalCount": { -->total count of docs matching regex
"value_count": {
"field": "id"
}
}
}
},
"mybucket_selector": { ---> include buckets where totalcount==finalcount
"bucket_selector": {
"buckets_path": {
"FinalCount": "filter_agg>finalCount",
"TotalCount": "totalCount"
},
"script": "params.FinalCount==params.TotalCount"
}
}
}
}
}
}

Resources