How to get documents that are differents by value field - elasticsearch

I'm using ElasticSearch 6.3.
Scenario: dozens of thousand documents has "123" field with "blabla" value in most of those. A few has "blabla blo" in that field. These occupy last places in query results if I set up size: 10000 (if default size, they doesn't appear). But I really want both unique records: one with these field "123": "blabla" and that one with field "123":"blabla blo".
I`m using wildcard and getting all 10000 documents. Only need those two.
I'm going to feed a select tag HTML with thats records, but only two of them ideally!
Query body:
{
"query": {
"wildcard":{
"324" : {
"value":"*b*"
}
}
},
"size": 10000,
"_source": ["324"]
}
How I should make it? The concept would be similar to find records which value aren't fully duplicated in that field, I supose.
Thank you

That's what aggs are for!
GET index_name/_search
{
"query": {
"wildcard": {
"324": {
"value": "*b*"
}
}
},
"size": 0,
"aggs": {
"324_uniques": {
"terms": {
"field": "324",
"size": 10
}
}
}
}
field could be 324 OR 324.keyword, depending on your mapping.

Related

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

How to limit search results from each index in a multi index search query?

I am using Elasticsearch version 6.3 and I want to make queries across multiple indices.Elasticsearch has support for this and I can give multiple indices as comma separated values in the url with one query in request body and also give size parameter to limit the number of search results returned.However this limits the size of the overall search results and might lead to no results from some indexes- so instead I want to fetch first n number of results from each index.
I tried using multi search api (_msearch) but with that it seems I have to give the same query and size for all indexes and that works, but I am not able to get a single aggregation over the entire result , is there any way to address both the issues?
Solution 1:
You're on the right path with the _msearch query. What I would do is to issue one query per index (no aggregations!) with the size you want for that index, as well as another query just for the aggregations, like this:
{ "index": "index1" }
{ "size": 5, "query": { ... }}
{ "index": "index2" }
{ "size": 5, "query": { ... }}
{ "index": "index3" }
{ "size": 5, "query": { ... }}
{ "index": "index1,index2,index3" }
{ "size": 0, "query": { ... }, "aggs": { ... } }
So the first three queries will return document hits from each of the three indexes and the last query will return the aggregation computed on all indexes, but no documents.
Solution 2:
Another way to tackle this if you have a small size, is to have a single query in the query part and then aggregate on the index name and retrieve hits from each index using top_hits, like this:
POST index1,index2,index3/_search
{
"size": 0,
"query": { ... },
"aggs": {
"indexes": {
"terms": {
"field": "_index",
"size": 50
},
"aggs": {
"hits": {
"top_hits": {
"size": 5
}
}
}
}
}
}

How to find top terms with occurrences in Elasticsearch

I have a fairly big dataset in Elasticsearch: 1 index, about 120 million records of one type. I am processing a large number of paragraphs on a given set of topics. The number of topics is limited and associated with a unique ID. Each paragraph has a couple of sentences identified by the sentence_id (unique across all topics). Each sentence has a number of words and each word can occur multiple times. So my mapping looks like the following:
{
"sentence_id": 1200,
"topic_id": 2,
"value": "ground",
"occurrences": 20
}
Now, I want to run a query which answers this:
"Find the top words for a given topic ID sorted by their occurrences."
So for each word in a topic, I have to sum up its occurrences across all the sentences, sort them and return.
I am not able to achieve this. I tried writing aggregation term query, but it does not sum occurrences and merely returns the unique count of records for each word.
{
"query": {
"term": {
"topic_id": {
"value": 3117
}
}
},
"aggs": {
"total_occurrences": {
"terms": {
"field": "occurrences",
"size": 1000
}
}
}
}
Can some one help me out?
I think first you need to aggregate on unique value, and then sum its occurrences, your query should look something like this assuming your occurrences field is numeric
{
"query": {
"term": {
"topic_id": {
"value": 3117
}
}
},
"aggs": {
"total_occurrences": {
"terms": {
"field": "value",
"size": 1000,
"order": {
"sum_occurrences": "desc" <--- to sort by top words
}
},
"aggs": {
"sum_occurrences": {
"sum": {
"field": "occurrences"
}
}
}
}
},
"size": 0
}
Hope this helps!

show all buckets from aggregation within a single _type where one index contains multiple _type with same field names

I created an index named "electronics". I created two _type in index i.e "mobiles", "laptops" which have common field name "screensize".
Since I need to show facets for all the terms present in the fields, I am using aggregations to generate the terms and its facets.
{
"aggs": {
"distinct_field": {
"terms": {
"field": "screensize",
'min_doc_count': 0,
'size': 0
}
}
}
}
In the response I am getting all the screensizes with _type of mobiles as well as laptops(Since lucene treats same field names from different types as single field.). I only need the terms present in mobiles even if their count is 0.
I thought about doing a filtered query for mobiles _type before doing aggregations, but the results were still the same.
{
"query": {
"filtered": {
"filter": {
"type": {
"value": "mobiles"
}
}
}
},
"aggs": {
"distinct_field": {
"terms": {
"field": "screensize",
'min_doc_count': 0,
'size': 0
}
}
}
}
Is there any way I could possibly get only the terms from a single _type for a particular field?
I'm suggesting another approach using a terms aggregation with a script instead of field like this. The script will only return the value of the screensize if the type of the document is mobiles and null instead. This should work, try it out:
{
"aggs": {
"distinct_field": {
"terms": {
"script": "doc._type.value == 'mobiles' ? doc.screensize.value : null",
"min_doc_count": 0,
"size": 0
}
}
}
}
For this to work you also need to make sure that scripting is enabled

To get hits inside aggregations,in elasticsearch

I have a date field inside my data. I did a date histogram aggregation on it,with interval set as month. Now it returns,the number of documents per month,interval.
Here is the query I used:
{
"aggs": {
"dateHistogram": {
"date_histogram": {
"field": "currentDate",
"interval": "day"
}
}
}
}
Below the exact response I have received.
{
"aggregations": {
"dateHistogram": {
"buckets": [{
"key_as_string": "2015-05-06",
"key": 1430870400000,
"doc_count": 10
}, {
"key_as_string": "2015-04-06",
"key": 1430870500000,
"doc_count": 14
}]
}
}
}
From the above response it is clear that,there are 10 documents under the key "1430870400000" and 14 documents under the key "1430870500000". But despite from the document count,the individual documents are not shown. I want them to be shown in the response,so that I can take values out from it. How do I achieve this in elasticsearch?
The easy method for this is using the "top-hits" aggregation. You can find the usage of "top-hits" here
Top-hits aggregation will give you the relevant data inside the aggregation you have done and also there are options to specify from which result you want to fetch,and the size of the data you want to be taken and also sort options.
As per my understanding you want to fetch all documents and used that documents for aggregations so you should use match query with aggregation as below :
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
]
}
},
"aggs": {
"date_wise_logs_counts": {
"date_histogram": {
"field": "currentDate",
"interval": "day"
}
}
}
}
Above return default 10 documents in hit array, use size size=BIGNUMBER to get more than 10 items. (where BIGNUMBER equals a number you believe is bigger than your dataset). But you should use scan and scroll instead of size

Resources