How get a distinct list of document fields using NEST? - elasticsearch

I have just started with Elasticsearch and am using the NEST API for my .Net application. I have an index and some records inserted. I am now trying to get a distinct list of document field values. I have this working in Postman. I do not know how to port the JSON aggregation body to a NEST call. Here is the call I am trying to port to the NEST C# API:
{
"size": 0,
"aggs": {
"hosts": {
"terms": {
"field": "host"
}
}
}
Here is the result which is my next question. How would I parse or assign a POCO to the result? I am only interested in the distinct list of the field value in this case 'host'. I really just want an enumerable of strings back. I do not care about the count at this point.
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"hosts": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "hoyt",
"doc_count": 3
}
]
}
}
}

I was able to get the results I am after with the following code:
var result = await client.SearchAsync<SyslogEntryIndex>(s => s.Size(0).Aggregations(a => a.Terms("hosts", t => t.Field(f => f.Host))));
List<string> hosts = new List<string>();
foreach (BucketAggregate v in result.Aggregations.Values)
{
foreach (KeyedBucket<object> item in v.Items)
{
hosts.Add((string)item.Key);
}
}
return hosts;

Related

How to get inner hits field values in Nest or Elastic.Net library ? Alterantivly how to specify output type in Nest or Elastic.Net library?

I am new to elasticsearch and I am having troubles with the Nest/Elastic.Net library.
I would like to retrieve not the entire document but just part of it. I am able to do it in Postman but I cannot do it via Elastic.Net library or Nest library.
Document structure looks like following
{
“Doc_id”: “id_for_cross_refference_with_othersystem”
“Ocr”:[
{
“word”: “example_word1”,
“box”: [],
“cord”: “some_number”,
},
{
“word”: “example_word2”,
“box”: [],
“cord”: “some_number2”,
}
]
}
The document has a huge amount of properties but I am interested only in Doc_id , ocr.word, ocr.box and ocr.cord.
The following postman request fully satisfies my needs :
{
"query": {
"bool": {
"must": [
{
"match": {
"doc_id": "2a558865-7dc2-4e4d-ad02-3f683159984e"
}
},
{
"nested": {
"path": "ocr",
"query": {
"match": {
"ocr.word": "signing"
}
},
"inner_hits": {
"_source": {
"includes":[
"ocr.word",
"ocr.box",
"ocr.conf"
]
}
}
}
}
]
}
},
"_source":"false"
}
Result of that request is following :
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 18.99095,
"hits": [
{
"_index": "irrelevant",
"_type": "irrelevant",
"_id": "irrelevant",
"_score": 18.99095,
"_source": {},
"inner_hits": {
"ocr": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 7.9260864,
"hits": [
{
"_index": "irrelevant",
"_type": "irrelevant",
"_id": "irrelevant",
"_nested": {
"field": "ocr",
"offset": 11
},
"_score": 7.9260864,
"_source": {
"box": [
],
"conf": "96.452858",
"word": "signing"
}
}
]
}
}
}
},
{
"_index": "there_rest _of_object_is_ommited",
},
{
"_index": "there_rest _of_object_is_ommited",
}
]
}
}
However when I try to convert that request to Nest Query DSL I am not able to achieve the same result.
When I try to use the NEST library I don’t see any way to provide output result model/type. It looks like the Type of Document should match the output type which is not my case.
Query that I am using :
var searchResponse = client2.Search<Model>(s => s
.Query(q1 => q1.Bool(b1 => b1.Must(s1 =>
s1.Match(m => m.Field(f => f.doc_id).Query("2a558865-7dc2-4e4d-ad02-3f683159984e")),
s2 => s2.Nested(n => n.Path("ocr").Query(q2 => q2.Bool(b => b.Must(m => m.Match(m => m.Field(f => f.ocr.First().word).Query("signing")))))
.InnerHits(ih => ih.Source(s => s.Includes(i => i.Field(f => f.ocr.First().word).Field(f => f.ocr.First().conf))))
)
)))
.Source(false)
);
Due to the fact that the Model type is created for a document and it doesn’t match the output type I am getting [null, null, null] as the output .
There is property such properties as Hits in ISearchResponse? But when I look into it I cannot see values of fields.
I tried using a low level client (Elastic.Net) and providing json request as a string. But It looks like there is not way of specifying the output type either. When I ran my code with the low level library it returns me 3 object of class Model with empty fields.
My questions are :
Is it possible to specify output type different from document type for Nest query DSL or Elatic.Net library ?
Is it possible to get values of the fields that I specified in request for inner hits with help of Nest or Elastic.Net libraries?
How would you solve such problem ? I mean we have huge documents and we don’t want to pass unnecessary information back and forth. The inner hits approach looks like a neat solution for us but it doesn’t look like it works with the recommended libraries Unless I am doing some silly mistake.
NOTE: I can achieve desired result with simple use of HTTPClient and manually doing what I need , but I hope to leverage library that is written for this purpose(Nest or Elastic.Net).

No results from search when passing more than one parameter in user metadata

I want to apply document level security in elastic, but once I provide more than one value in user metadata I get no matches.
I am creating a role and a user in elastic and passing values inside user metadata to the role on whose basis the search should happen. It works fine if I give one value.
For creating role:
PUT _xpack/security/role/my_policy
{
"indices": [{
"names": ["my_index"],
"privileges": ["read"],
"query": {
"template": {
"source": "{\"bool\": {\"filter\": [{\"terms_set\": {\"country_name\": {\"terms\": {{#toJson}}_user.metadata.country_name{{/toJson}},\"minimum_should_match_script\":{\"source\":\"params.num_terms\"}}}}]}}"
}
}
}]
}
And for user:
PUT _xpack/security/user/jack_black
{
"username": "jack_black",
"password":"testtest",
"roles": ["my_policy"],
"full_name": "Jack Black"
"email": "jb#tenaciousd.com",
"metadata": {
"country_name": ["india" , "japan"]
}
}
I expect the output to be results for india and japan only. If the user searches for anything else they should get no results.
However, I do not see any results at all:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}

Possible to get the size_in_bytes for records matching a specific query?

The documentation on the stats api indicates that we can do the following:
http://es.cluster.ip.addr:9200/indexname/_stats
Which resuls in an output like:
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"_all": {
"primaries": {
"docs": {
"count": 32930,
"deleted": 0
},
"store": {
"size_in_bytes": 3197332,
"throttle_time_in_millis": 0
},
// ... etc
}
}
}
My question is, is there a way to obtain the file size for a specific set of records, specific such as when we run a search query:
http://es.cluster.ip.addr:9200/indexname/type/_search?q=identifier:123
So essentially, the size_in_bytes for all records matching the identifier 123?

Convert any Elasticsearch response to simple field value format

On elastic search, when doing a simple query like:
GET miindex-*/mytype/_search
{
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*"
}
}
}
It returns a format like:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 28,
"max_score": 1,
"hits": [
...
So I parse like response.hits.hits to get the actual records.
However if you are doing another type of query e.g. aggregation, the response is totally different like:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 28,
"max_score": 0,
"hits": []
},
"aggregations": {
"myfield": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
...
and I need to look actually in another json property: response.aggregations.myfield.buckets which gets even more complicated if you have more than one aggregation.
So, my question is very simple, isn't there a way that I can get Elasticsearch to response always with just the fields I want just like in SQL format:
E.g.
SELECT author, bookid FROM books
Would return:
{"author":"rogers", "bookid":099991}
{"author":"peter", "bookid":099992}
SELECT COUNT(author) As count_author, author, count(bookid) As count_bookid, bookid FROM books GROUP BY author, bookid
Would return:
{"count_author":4, "author":"rogers", "count_bookid":9, "bookid":099991}
{"count_author":8, "author":"peter", "count_bookid":9, "bookid":099992}
Is there a way to show only the fields I want and nothing else?(not having to look within nested json objects and all that stuff). (I want this because I'm doing many reports and I want to have a simple function that parses each response easily in a uniform way.)

Elastic search Nest TopHits aggregation

I've been struggling with a problem for a while now, so i thought i would swing this by stackoverflow.
My document type has a title, a language field (used to filter) and a grouping id field (im leaving out all the other fields to keep this to the point)
When i search for documents i want to find all documents containing the text in the title. I only want one document for each unique grouping id.
I've been looking at tophits aggregation, and from what i can see it should be able to solve my problem.
When running this query against my index:
{
"query": {
"match": {
"title": "dingo"
}
},
"aggs": {
"top-tags": {
"terms": {
"field": "groupId",
"size": 1000000
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"_source": {
"include": [
"*"
]
},
"size": 1
}
}
}
}
}
}
I get the following response (All results are in the same language):
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"top-tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "3044BC9E7C29450AAB2E4B6C9B35AAE2",
"doc_count": 2,
"top_tag_hits": {
"hits": {
"total": 2,
"max_score": 1.4983996,
"hits": [{
"_index": "elasticsearch",
"_type": "productdocument",
"_id": "FB15279FB18E4B34AD66ACAF69B96E9E",
"_score": 1.4983996,
"_source": {
"groupId": "3044BC9E7C29450AAB2E4B6C9B35AAE2",
"title": "wombat, dingo and zetapunga actionfigures",
}
}]
}
}
},
{
"key": "F11799ABD0C14B98ADF2554C84FF0DA0",
"doc_count": 1,
"top_tag_hits": {
"hits": {
"total": 1,
"max_score": 1.30684,
"hits": [{
"_index": "elasticsearch",
"_type": "productdocument",
"_id": "42562A25E4434A0091DE0C79A3E7F3F4",
"_score": 1.30684,
"_source": {
"groupId": "F11799ABD0C14B98ADF2554C84FF0DA0",
"title": "awesome dingo raptor"
}
}]
}
}
}]
}
}
}
This is exactly what i expected (two hits in one bucket, but only one document retrieved for that bucket). However when i try this in NEST i can't seem to retrieve all of the documents.
My query looks like this:
result = _elasticClient.Search<T>(s => s
.From(skip)
.Filter(fd => fd.Term(f => f.Language, language))
.Size(pageSize)
.SearchType(SearchType.Count)
.Query(
q => q.Wildcard(f => f.Title, query, 2.0)
|| q.Wildcard(f => f.Description, query)
)
.Aggregations(agd =>
agd.Terms("groupId", tagd => tagd
.Field("groupId")
.Size(100000) //We sadly need all products
)
.TopHits("top_tag_hits", thagd => thagd
.Size(1)
.Source(ssd => ssd.Include("*")))
));
var topHits = result.Aggs.TopHits("top_tag_hits");
var documents = topHits.Documents<ProductDocument>(); //contains only one document (I would expect it to contain two, one for each bucket)
Inspecting the aggregations in the debugger reveals there is a "groupId" aggregation with 2 buckets (and matching what i see in my "raw" query against the index. Just without any apparent way to retrieve the documents)
So my question is. How do i retrieve the top hit for each bucket? Or am i doing this completely wrong? Is there some other way to achieve what i am trying to do?
EDIT
After the help i received, i was able to retrieve my results with the following:
result = _elasticClient.Search<T>(s => s
.From(skip)
.Filter(fd => fd.Term(f => f.Language, language))
.Size(pageSize)
.SearchType(SearchType.Count)
.Query(
q => q.Wildcard(f => f.Title, query, 2.0)
|| q.Wildcard(f => f.Description, query)
)
.Aggregations(agd =>
agd.Terms("groupId", tagd => tagd
.Field("groupId")
.Size(0)
.Aggregations(tagdaggs =>
tagdaggs.TopHits("top_tag_hits", thagd => thagd
.Size(1)))
)
)
);
var groupIdAggregation = result.Aggs.Terms("groupId");
var topHits =
groupIdAggregation.Items.Select(key => key.TopHits("top_tag_hits"))
.SelectMany(topHitMetric => topHitMetric.Documents<ProductDocument>()).ToList();
Your NEST query tries to run both Terms aggregation and TopHits side by side, while your original query runs Terms first and then for each bucket, you're calling TopHits.
You simply have to move your TopHits agg into Terms in your NEST query to make it work fine.
This should fix it:
.Aggregations(agd =>
agd.Terms("groupId", tagd => tagd
.Field("groupId")
.Size(0)
.Aggregations(tagdaggs =>
tagdaggs.TopHits("top_tag_hits", thagd => thagd
.Size(1)))
)
)
By the way, you don't have to use Include("*") to include all fields. Just remove this option, also specifying .Size(0) should bring back ALL possible terms for you.

Resources