Elastic search Nest TopHits aggregation - elasticsearch

I've been struggling with a problem for a while now, so i thought i would swing this by stackoverflow.
My document type has a title, a language field (used to filter) and a grouping id field (im leaving out all the other fields to keep this to the point)
When i search for documents i want to find all documents containing the text in the title. I only want one document for each unique grouping id.
I've been looking at tophits aggregation, and from what i can see it should be able to solve my problem.
When running this query against my index:
{
"query": {
"match": {
"title": "dingo"
}
},
"aggs": {
"top-tags": {
"terms": {
"field": "groupId",
"size": 1000000
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"_source": {
"include": [
"*"
]
},
"size": 1
}
}
}
}
}
}
I get the following response (All results are in the same language):
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"top-tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "3044BC9E7C29450AAB2E4B6C9B35AAE2",
"doc_count": 2,
"top_tag_hits": {
"hits": {
"total": 2,
"max_score": 1.4983996,
"hits": [{
"_index": "elasticsearch",
"_type": "productdocument",
"_id": "FB15279FB18E4B34AD66ACAF69B96E9E",
"_score": 1.4983996,
"_source": {
"groupId": "3044BC9E7C29450AAB2E4B6C9B35AAE2",
"title": "wombat, dingo and zetapunga actionfigures",
}
}]
}
}
},
{
"key": "F11799ABD0C14B98ADF2554C84FF0DA0",
"doc_count": 1,
"top_tag_hits": {
"hits": {
"total": 1,
"max_score": 1.30684,
"hits": [{
"_index": "elasticsearch",
"_type": "productdocument",
"_id": "42562A25E4434A0091DE0C79A3E7F3F4",
"_score": 1.30684,
"_source": {
"groupId": "F11799ABD0C14B98ADF2554C84FF0DA0",
"title": "awesome dingo raptor"
}
}]
}
}
}]
}
}
}
This is exactly what i expected (two hits in one bucket, but only one document retrieved for that bucket). However when i try this in NEST i can't seem to retrieve all of the documents.
My query looks like this:
result = _elasticClient.Search<T>(s => s
.From(skip)
.Filter(fd => fd.Term(f => f.Language, language))
.Size(pageSize)
.SearchType(SearchType.Count)
.Query(
q => q.Wildcard(f => f.Title, query, 2.0)
|| q.Wildcard(f => f.Description, query)
)
.Aggregations(agd =>
agd.Terms("groupId", tagd => tagd
.Field("groupId")
.Size(100000) //We sadly need all products
)
.TopHits("top_tag_hits", thagd => thagd
.Size(1)
.Source(ssd => ssd.Include("*")))
));
var topHits = result.Aggs.TopHits("top_tag_hits");
var documents = topHits.Documents<ProductDocument>(); //contains only one document (I would expect it to contain two, one for each bucket)
Inspecting the aggregations in the debugger reveals there is a "groupId" aggregation with 2 buckets (and matching what i see in my "raw" query against the index. Just without any apparent way to retrieve the documents)
So my question is. How do i retrieve the top hit for each bucket? Or am i doing this completely wrong? Is there some other way to achieve what i am trying to do?
EDIT
After the help i received, i was able to retrieve my results with the following:
result = _elasticClient.Search<T>(s => s
.From(skip)
.Filter(fd => fd.Term(f => f.Language, language))
.Size(pageSize)
.SearchType(SearchType.Count)
.Query(
q => q.Wildcard(f => f.Title, query, 2.0)
|| q.Wildcard(f => f.Description, query)
)
.Aggregations(agd =>
agd.Terms("groupId", tagd => tagd
.Field("groupId")
.Size(0)
.Aggregations(tagdaggs =>
tagdaggs.TopHits("top_tag_hits", thagd => thagd
.Size(1)))
)
)
);
var groupIdAggregation = result.Aggs.Terms("groupId");
var topHits =
groupIdAggregation.Items.Select(key => key.TopHits("top_tag_hits"))
.SelectMany(topHitMetric => topHitMetric.Documents<ProductDocument>()).ToList();

Your NEST query tries to run both Terms aggregation and TopHits side by side, while your original query runs Terms first and then for each bucket, you're calling TopHits.
You simply have to move your TopHits agg into Terms in your NEST query to make it work fine.
This should fix it:
.Aggregations(agd =>
agd.Terms("groupId", tagd => tagd
.Field("groupId")
.Size(0)
.Aggregations(tagdaggs =>
tagdaggs.TopHits("top_tag_hits", thagd => thagd
.Size(1)))
)
)
By the way, you don't have to use Include("*") to include all fields. Just remove this option, also specifying .Size(0) should bring back ALL possible terms for you.

Related

How to get inner hits field values in Nest or Elastic.Net library ? Alterantivly how to specify output type in Nest or Elastic.Net library?

I am new to elasticsearch and I am having troubles with the Nest/Elastic.Net library.
I would like to retrieve not the entire document but just part of it. I am able to do it in Postman but I cannot do it via Elastic.Net library or Nest library.
Document structure looks like following
{
“Doc_id”: “id_for_cross_refference_with_othersystem”
“Ocr”:[
{
“word”: “example_word1”,
“box”: [],
“cord”: “some_number”,
},
{
“word”: “example_word2”,
“box”: [],
“cord”: “some_number2”,
}
]
}
The document has a huge amount of properties but I am interested only in Doc_id , ocr.word, ocr.box and ocr.cord.
The following postman request fully satisfies my needs :
{
"query": {
"bool": {
"must": [
{
"match": {
"doc_id": "2a558865-7dc2-4e4d-ad02-3f683159984e"
}
},
{
"nested": {
"path": "ocr",
"query": {
"match": {
"ocr.word": "signing"
}
},
"inner_hits": {
"_source": {
"includes":[
"ocr.word",
"ocr.box",
"ocr.conf"
]
}
}
}
}
]
}
},
"_source":"false"
}
Result of that request is following :
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 18.99095,
"hits": [
{
"_index": "irrelevant",
"_type": "irrelevant",
"_id": "irrelevant",
"_score": 18.99095,
"_source": {},
"inner_hits": {
"ocr": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 7.9260864,
"hits": [
{
"_index": "irrelevant",
"_type": "irrelevant",
"_id": "irrelevant",
"_nested": {
"field": "ocr",
"offset": 11
},
"_score": 7.9260864,
"_source": {
"box": [
],
"conf": "96.452858",
"word": "signing"
}
}
]
}
}
}
},
{
"_index": "there_rest _of_object_is_ommited",
},
{
"_index": "there_rest _of_object_is_ommited",
}
]
}
}
However when I try to convert that request to Nest Query DSL I am not able to achieve the same result.
When I try to use the NEST library I don’t see any way to provide output result model/type. It looks like the Type of Document should match the output type which is not my case.
Query that I am using :
var searchResponse = client2.Search<Model>(s => s
.Query(q1 => q1.Bool(b1 => b1.Must(s1 =>
s1.Match(m => m.Field(f => f.doc_id).Query("2a558865-7dc2-4e4d-ad02-3f683159984e")),
s2 => s2.Nested(n => n.Path("ocr").Query(q2 => q2.Bool(b => b.Must(m => m.Match(m => m.Field(f => f.ocr.First().word).Query("signing")))))
.InnerHits(ih => ih.Source(s => s.Includes(i => i.Field(f => f.ocr.First().word).Field(f => f.ocr.First().conf))))
)
)))
.Source(false)
);
Due to the fact that the Model type is created for a document and it doesn’t match the output type I am getting [null, null, null] as the output .
There is property such properties as Hits in ISearchResponse? But when I look into it I cannot see values of fields.
I tried using a low level client (Elastic.Net) and providing json request as a string. But It looks like there is not way of specifying the output type either. When I ran my code with the low level library it returns me 3 object of class Model with empty fields.
My questions are :
Is it possible to specify output type different from document type for Nest query DSL or Elatic.Net library ?
Is it possible to get values of the fields that I specified in request for inner hits with help of Nest or Elastic.Net libraries?
How would you solve such problem ? I mean we have huge documents and we don’t want to pass unnecessary information back and forth. The inner hits approach looks like a neat solution for us but it doesn’t look like it works with the recommended libraries Unless I am doing some silly mistake.
NOTE: I can achieve desired result with simple use of HTTPClient and manually doing what I need , but I hope to leverage library that is written for this purpose(Nest or Elastic.Net).

How get a distinct list of document fields using NEST?

I have just started with Elasticsearch and am using the NEST API for my .Net application. I have an index and some records inserted. I am now trying to get a distinct list of document field values. I have this working in Postman. I do not know how to port the JSON aggregation body to a NEST call. Here is the call I am trying to port to the NEST C# API:
{
"size": 0,
"aggs": {
"hosts": {
"terms": {
"field": "host"
}
}
}
Here is the result which is my next question. How would I parse or assign a POCO to the result? I am only interested in the distinct list of the field value in this case 'host'. I really just want an enumerable of strings back. I do not care about the count at this point.
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"hosts": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "hoyt",
"doc_count": 3
}
]
}
}
}
I was able to get the results I am after with the following code:
var result = await client.SearchAsync<SyslogEntryIndex>(s => s.Size(0).Aggregations(a => a.Terms("hosts", t => t.Field(f => f.Host))));
List<string> hosts = new List<string>();
foreach (BucketAggregate v in result.Aggregations.Values)
{
foreach (KeyedBucket<object> item in v.Items)
{
hosts.Add((string)item.Key);
}
}
return hosts;

Elasticsearch - Range query work wrong

I used kibana DEV Tools to query some range data,but there have 2 hits is out of my expectation,why it happens?
image of the range query
the query:
{
"query" : {
"constant_score" : {
"filter" : {
"range" : {
"rss" : {
"gte": 3000000
}
}
}
}
}
}
the result:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 69,
"successful": 69,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "BBQ",
"_type": "BBQ",
"_id": "AWGJaCYkk-tGbWgj2e6R",
"_score": 1,
"_source": {
"message": [
"nodeProcessInfo"
],
"#timestamp": "2018-02-12T09:45:59.525Z",
"rss": "92636",
"#version": "1",
"host": "192.168.213.96"
}
},
{
"_index": "BBQ",
"_type": "BBQ",
"_id": "AWGJaJxzk-tGbWgj2e-V",
"_score": 1,
"_source": {
"message": [
"nodeProcessInfo"
],
"#timestamp": "2018-02-12T09:46:29.680Z",
"rss": "85272",
"#version": "1",
"host": "192.168.213.96"
}
}
]
}
}
The result of range query is not in my expectation, why gte => 3000000 but rss = 92636 appeared?
======================edit at 2018.2.13=========(1)
the log like this:
"nodeProcessInfo|auth-server-1|auth|9618|1.9|1.2|98060|2018-2-12 6:33:43 PM|"
the filter like this:
filter {
if "nodeProcessInfo" in [message] {
mutate {
split => ["message", "|"]
add_field => {
"serverId" => "%{[message[1]]}"
}
add_field => {
"serverType" => "%{[message[2]]}"
}
add_field => {
"pid" => "%{[message[3]]}"
}
add_field => {
"cpuAvg" => "%{[message[4]]}"
}
add_field => {
"memAvg" => "%{[message[5]]}"
}
add_field => {
"rss" => "%{[message[6]]}"
}
add_field => {
"time" => "%{[message[7]]}"
}
convert => ["rss", "integer"] # I try convert rss to int, but failed
add_tag => "nodeProcessInfo"
}
}
}
======================edit at 2018.2.13=========(2)
I let the convert code in a new mutate, and it worked to make "rss" into int type,but the result of range query also wrong,the change code like this:
if "nodeProcessInfo" in [message] {
mutate {
split => ["message", "|"]
...
...
add_field => {
"rss" => "%{[message[6]]}"
}
}
mutate {
convert => ["rss", "integer"] # add a new mutate here
}
}
======================edit at 2018.2.13=========(3)
At last I found the reason why rss'type is converted to int but range query also wrong:
"You can't change existing mapping type, you need to create a new index with the correct mapping and index the data again."
so I create a new field name to instead of rss and the result of range query is right now.
Can you share the mapping of the index.
I thing the problem is as i can see in the search results which you have shared , the type of the rss field is text or string.
If it is so then the range query you are using is treating them as string characters and giving you results according to that.
And what you are trying to use is number ranges which will work if you index data with type of rss field as long and then fire the same query.
You would then get the desired reuslts

Wrong data shown when searched

i have a dataset of more than a million rows. I have integrated elasticsearch with Mysql using logstash.
When i type the following URL to fetch in postman,
http://localhost:9200/persondetails/Document/_search?q=*
i get the following:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "persondetails",
"_type": "Document",
"_id": "%{idDocument}",
"_score": 1,
"_source": {
"iddocument": 514697,
"#timestamp": "2017-08-31T05:18:46.916Z",
"author": "vaibhav",
"expiry_date": null,
"#version": "1",
"description": "ly that",
"creation_date": null,
"type": 1
}
},
{
"_index": "persondetails",
"_type": "Document_count",
"_id": "AV4o0J3OJ5ftvuhV7i0H",
"_score": 1,
"_source": {
"query": {
"term": {
"author": "rishav"
}
}
}
}
]
}
}
it is wrong as the number of rows in my table is more than 1 million and this shows that total is only 2. I am unable to find what is the mistake here.
when i type http://localhost:9200/_cat/indices?v
It shows this
health:yellow
status:open
index:persondetails
uuid:4FiGngZcQfS0Xvu6IeHIfg
pri:5
rep : 1
docs.count : 2
docs.deleted :1054
store.size : 125.4kb
pri.store.size : 125.4kb
This is my logstash.conf file
input {
jdbc {
jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/persondetails"
jdbc_user => "root"
jdbc_password => ""
schedule => "* * * * *"
jdbc_validate_connection => true
jdbc_driver_library => "/usr/local/Cellar/logstash/5.5.2/mysql-connector-java-3.1.14/mysql-connector-java-3.1.14-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
statement => "SELECT * FROM Document"
type => "persondetails"
}
}
output {
elasticsearch {
#protocol=>http
index =>"persondetails"
document_type => "Document"
document_id => "%{idDocument}"
hosts => ["http://localhost:9200"]
stdout{ codec => rubydebug}
}
}
From your result, it looks like there is an issue with your logstash configuration which is causing your document to be overwritten because the document_id is not getting generated, and effectively there is only one document in your index with document Id as "%{idDocument}"
See the following _source snippet from the result to the search query you provided:
"_source": {
"iddocument": 514697,
"#timestamp": "2017-08-31T05:18:46.916Z",
"author": "vaibhav",
"expiry_date": null,
"#version": "1",
"description": "ly that",
"creation_date": null,
"type": 1
}
Even looking at the small size of the index, it doesn't look like there are more documents. You should look at whether your jdbc input is providing the "idDocument" field.

using elasticsearch filter in logstash pipeline

I'm using the elasticsearch filter in my logstash pipeline. I correctly find the result using :
filter{
if [class] == "DPAPIINTERNAL" {
elasticsearch {
hosts => "10.1.10.16"
index => "dp_audit-2017.02.16"
query_template => "/home/vittorio/Documents/elastic-queries/matching-requestaw.json"
}
}
}
as you can see, Im using "query_template" which is :
{
"query": {
"query_string": {
"query": "class:DPAPI AND request.aw:%{[aw]}"
}
},
"_source": ["end_point", "vittorio"]
}
that tells elastichsearch to look up the log with that specific class that match "aw" with the DPAPIINTERNAL log.
Perfect! but now that i found the result, i want to add some field from it and attach them to my DPAPIINTERNAL log, for instance, i want to take "end_point" and add it in the new key "vittorio" inside my log.
This is not happening and I don't understand why.
here is the log that i'm looking at using the query:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "dp_audit-2017.02.16",
"_type": "logs",
"_id": "AVpHoPHPuEPlW12Qu",
"_score": 1,
"_source": {
"svc": "dp-1.1",
"request": {
"method": "POST|PATCH|DELETE",
"aw": "prova",
"end_point": "/bank/6311",
"app_instance": "7D1-D233-87E1-913"
},
"path": "/home/vittorio/Documents/dpapi1.json",
"#timestamp": "2017-02-16T15:53:33.214Z",
"#version": "1",
"host": "Vito",
"event": "bank.add",
"class": "DPAPI",
"ts": "2017-01-16T19:20:30.125+01:00"
}
}
]
}
}
Your need to specify the fields parameter in your elasticsearch filter, like this:
elasticsearch {
hosts => "10.1.10.16"
index => "dp_audit-2017.02.16"
query_template => "/home/vittorio/Documents/elastic-queries/matching-requestaw.json"
fields => { "[request][end_point]" => "vittorio" }
}
Note that since end_point is a nested field, you need to modify the _source in your query template like this:
"_source": ["request.end_point"]
the problem is simply that you don't have to specify the "new" field using the query_template.
"_source": ["request"] # here you specify the field you want from the query result.
and then
filter{
if [class] == "DPAPIINTERNAL" {
elasticsearch {
hosts => "10.1.10.16"
index => "dp_audit-2017.02.16"
query_template => "/home/vittorio/Documents/elastic-queries/matching-requestaw.json"
fields => {"request" => "new_key"} # here you add the fields and will tell elastich filter to put request inside new_key
}
}
}
That worked for me!

Resources