Different set of results for "significant terms" in Elasticsearch using REST Api or Transportclient - elasticsearch

We use the new significant terms plugin in elasticsearch. Using the transport client I get less results compared to that when I use the REST API. I don't understand why. Using the node client is unfortunately not possible, since my service using ES is not in the same network. Why are the results different?
Here is the REST call:
POST /searchresults_sharded/article/_search
{
"query": {
"match": {
"titlebody": {
"query": "japanische hundenamen",
"operator": "and"
}
}
},
"aggregations": {
"searchresults": {
"significant_terms": {
"field": "titlebody",
"size": 100
}
}
}
}
and here the scala request building code:
val builder = reqBuilder.searchReqBuilder
builder.setIndices(indexCoords.indexName)
builder.setTypes(indexCoords.typeName)
builder.setQuery(QueryBuilders.matchQuery(indexCoords.field, keywords.mkString(" ")).operator(MatchQueryBuilder.Operator.AND))
val sigTermAggKey: String = "significant-term"
val sigTermBuilder = new SignificantTermsBuilder(sigTermAggKey)
sigTermBuilder.field(indexCoords.field)
sigTermBuilder.size(size)
builder.addAggregation(sigTermBuilder)

I used toString on the Builder and found out: Reason was the different size parameter of the two requests. Both request sizes (20 and 100) were bigger than the returned signif.term-aggregation bucket size (7 compared to 1) but it seems that the query size param has an impact on the returned size even if it's far below the query size parameter

Related

Daterange + top_hits aggregation (as subaggregation) with Elasticsearch Java API Client 7.17.x

I've been at this for a day and I don't quite understand how I do it! This is the query I want to "recreate" with the new Java API Client (using Spring Boot)
{
"aggs": {
"range": {
"date_range": {
"field": "timestamp",
"ranges": [
{ "to": "now-2d" }
]
}
}
,
"aggs": {
"top_hits": {
"_source": {
"includes": [ "Id", "timestamp" ]
}
}
}
}
}
I tried doing it with DateRangeAggregation.of but I can't seem to get the right results or type. Here's what I have
SearchResponse<MyDto> response = client.search(b -> b
.index("test-index")
.size(0)
.aggregations("range",a->a.dateRange(DateRangeAggregation.of(d->d
.field("timestamp").ranges(r->r.to(t->t.expr("now-2d")))))),
.aggregations("hits", a -> a
.topHits(h->h.source(SourceConfig.of(c->c.filter(f->f.includes(Arrays.asList("Id", "timestamp"))))))),
MyDto.class
);
I've also tried removing the subaggregation and query for now, but I don't seem to be on the right track to even get the number of doc_count from the bucket. I kind of don't get how to work with the dateRange() here.
Edit: I played around a bit and was able to at least get the number of doc_count, I'm not very sure if this is a good way to do it though?
Aggregation agg = Aggregation.of(a -> a
.dateRange(d->d.field("timestamp").ranges(r->r.to(FieldDateMath.of(v->v.expr("now-2d"))))));
SearchResponse<MyDto> response = client.search(b -> b
.index("test-index")
.size(0)
.aggregations("range", agg),
MyDto.class
);
return response.aggregations().get("range").dateRange().buckets().array().get(0).docCount();
I also fixed the query above, it had an unnecessary extra query that broke the result.
My thought process was wrong. I wanted the documents that were aggregated within this a time but I misunderstood and thought tophits would give them to me, but that's not how it works! I made a seperate range query that actually queries the documents I needed back instead.

Elasticsearch - get (unfiltered) aggregates for a (filtered) subset

I have an elasticsearch index containing "hit" documents (with fields like ip/timestamp/uri etc) which are populated from my nginx access logs.
I'm looking for a method of getting the total number of hits / ip - but for a subset of IPs, namely the ones that did a request today.
I know I can have a filtered aggregation by doing:
/search?size=0
{
'query': { 'bool': { 'must': [
{'range': { 'timestamp': { 'gte': $today}}},
{'query_string': {'query': 'status:200 OR status:404'}},
]}},
'aggregations': {'c': {'terms': {'field': 'ip', 'size': 99999}}}
}
but this will sum only the hits that were done today, I want the total number of hits in the index but only from IPs that have hits today. Is this possible?
-edit-
I've tried the global option but while
'aggregations': {'c': {'global': {}, 'aggs': {'c2': {'terms': {'field': 'remote_user', 'size': 99999}}}}}
returns counts from all IPs; it ignores my filter on timestamp (eg. it includes IPs that did hits a couple of days ago)
There is a way to achieve what you want in a single query but since it involves scripting and the performance might suffer depending on the volume of data you will be running this query on.
The idea is to leverage the scripted_metric aggregation in order to build your own aggregation logic over the whole document set.
What we do below is pretty simple:
we don't give any query, so we consider the full document set
Map phase: we build a map of all IPs and for each
we count the total number of hits
we flag it if it had hits today AND with the given status (same as what you do in your query)
Reduce phase: we return the total hits count for each IP that was flagged as having hits today
Here is how the query looks like:
POST my-index/_search
{
"size": 0,
"aggs": {
"all_time_hits": {
"scripted_metric": {
"init_script": "state.ips = [:]",
"map_script": """
// initialize total hits count for each IP and increment
def ip = doc['ip.keyword'].value;
if (state.ips[ip] == null) {
state.ips[ip] = [
'total_hits': 0,
'hits_today': false
]
}
state.ips[ip].total_hits++;
// flag IP if:
// 1. it has hits today
// 2. the hit had one of the given statuses
def today = Instant.ofEpochMilli(new Date().getTime()).truncatedTo(ChronoUnit.DAYS);
def hitDate = doc['timestamp'].value.toInstant().truncatedTo(ChronoUnit.DAYS);
def hitToday = today.equals(hitDate);
def statusOk = params.statuses.indexOf((int) doc['status'].value) >= 0;
state.ips[ip].hits_today = state.ips[ip].hits_today || (hitToday && statusOk);
""",
"combine_script": "return state.ips;",
"reduce_script": """
def ips = [:];
for (state in states) {
for (ip in state.keySet()) {
// only consider IPs that had hits today
if (state[ip].hits_today) {
if (ips[ip] == null) {
ips[ip] = 0;
}
ips[ip] += state[ip].total_hits;
}
}
}
return ips;
""",
"params": {
"statuses": [200, 404]
}
}
}
}
}
And here is how the answer looks like:
"aggregations" : {
"all_time_hits" : {
"value" : {
"123.123.123.125" : 1,
"123.123.123.123" : 4
}
}
}
I think that pretty much does what you expect.
The other option (more performant because no script) requires you to make two queries. First, a query with the date range and status check with a terms aggregation to retrieve all IPs that have hits today (like you do now), and then a second query where you filter on those IPs (using a terms query) over the whole index (no date range or status check) and get hits count for each of them using a terms aggregation.
In the example you have shared you have a query and your documents are filtered according to that. But you want your aggregation to take all documents regardless of the query.
This is why the global option exists.
This context is defined by the indices and the document types you’re searching on, but is not influenced by the search query itself.
Sample query example:
{
"query": {
"match": { "type": "t-shirt" }
},
"aggs": {
"all_products": {
"global": {},
"aggs": {
"avg_price": { "avg": { "field": "price" } }
}
}
}
}

Elasticsearch From and Size on aggregation for pagination

First of all, I want to say that the requirement I want to achieve is working very well on SOLR 5.3.1 but not on ElasticSearch 6.2 as a service on AWS.
My actual query is very large and complex and it is working fine on kibana but not when I cross the from = 100 and size = 50 as it is showing error on kibana console,
What I know:
For normal search, the maximum from can be 10000 and
for aggregated search, the maximum from can be 100
If I cross that limit then I've to change the maximum limit which is not possible as I am using ES on AWS as a service OR I've use scroll API with scroll id feature to get paginated data.
The Scroll API works fine as I've used it to another part of my project but when I try the same Scroll with aggregation it is not working as expected.
Here with Scroll API, the first search gets the aggregated data but the second calling with scroll id not returns the Aggregated results only showing the Hits result
Query on Kibana
GET /properties/_search
{
"size": 10,
"query": {
"bool": {
"must": [
{
"match": {
"published": true
}
},
{
"match": {
"country": "South Africa"
}
}
]
}
},
"aggs": {
"aggs_by_feed": {
"terms": {
"field": "feed",
"order": {
"_key": "desc"
}
},
"aggs": {
"tops": {
"top_hits": {
from: 100,
size: 50,
"_source": [
"id",
"feed_provider_id"
]
}
}
}
}
},
"sort": [
{
"instant_book": {
"order": "desc"
}
}
]
}
With Search on python: The problem I'm facing with this search, first time the search gets the Aggregated data along with Hits data but for next calling with scroll id it is not returning the Aggregated data only showing the Hits data.
if index_name is not None and doc_type is not None and body is not None:
es = init_es()
page = es.search(index_name,doc_type,scroll = '30s',size = 10, body = body)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while (scroll_size > 0):
print("Scrolling...")
page = es.scroll(scroll_id=sid, scroll='30s')
# Update the scroll ID
sid = page['_scroll_id']
print("scroll id: " + sid)
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
print("scroll size: " + str(scroll_size))
print("scrolled data :" )
print(page['aggregations'])
With Elasticsearch-DSL on python: With this approach I'm struggling to select the _source fields names like id and feed_provider_id on the second aggs i.g tops->top_hits
es = init_es()
s = Search(using=es, index=index_name,doc_type=doc_type)
s.aggs.bucket('aggs_by_feed', 'terms', field='feed').metric('top','top_hits',field = 'id')
response = s.execute()
print('Hit........')
for hit in response:
print(hit.meta.score, hit.feed)
print(response.aggregations.aggs_by_feed)
print('AGG........')
for tag in response.aggregations.aggs_by_feed:
print(tag)
So my question is
Is it not possible to get data using from and size field on for the aggregated query above from=100?
if it is possible then please give me a hint with normal elasticsearch way or elasticsearch-dsl python way as I am not well known with elasticsearch-dsl and elasticsearch bucket, matric etc.
Some answer on SO told to use partition but I don't know how to use it on my scenario How to control the elasticsearch aggregation results with From / Size?
Some others says that this feature is not currently supported by ES (currently on feature request). If that's not possible, what else can be done in place of grouping in Solr?

aggregration to return all values not do group by

can aggregatin return all values? is there any way to do this with scripts?
{
"size": 0,
"_source":["docDescription","datasource"],
"query": {
"match_all":{}
},
"aggs":{
"projectNameMatchCount": {
"filter" : { "match": { "docDescription": ".ppt" } },
"aggs":{
"names":{
"terms":{"field":"_id"}
}
}
},
"datasourceSourceMatchCount": {
"filter" : { "match": { "datasource": "NGA" } }
}
}
}
in aggeration projectMatchCount, I am applying filter , and call other aggegration, to return the values, but term will do a group by, I don't want group by, all I want is return the field values
Aggregations are for grouping together data sets to drive a certain metric. If you want individual elements to be returned, you should run direct queries/filter instead. Aggregations are post processes which runs on the data set narrowed down by your query and comparatively expensive than your queries/filter. So, they should be avoided till you need aggregated metrics.
Having said that, from what I understood from your query is that you are using two aggregations. You want one to return some document IDs and the other to just return a count based on a different filter. It is possible to do so by making use of top-hits aggregation within the filter aggregation in projectNameMatchCount. For more details: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html
But still, I believe you will benefit more by simply making two separate queries in terms of total query time and the resources consumed at ElasticSearch side, one with a query to return the IDs and the other with aggregation to return the count of docs.

Limit and Offset in Term Aggregation ElasticSearch

There is way to get the top n terms result. For example:
{
"aggs": {
"apiSalesRepUser": {
"terms": {
"field": "userName",
"size": 5
}
}
}
}
Is there any way to set the offset for the terms result?
If you mean something like ignore first m results and return the next n results then no; it is not possible. A workaround to that would be to set size to m + n and do client side processing to ignore the first m results.
A little late, but (at least) since Elastic 5.2.0 you can use partitioning in the terms aggregation to paginate results.
https://www.elastic.co/guide/en/elasticsearch/reference/5.2/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions
Maybe this helps a bit:
"aggregations": {
"apiSalesRepUser": {
"terms": {
"field": "userName",
"size": 9999 ---> add here a bigger size
}
},
"aggregations": {
"limitBucket": {
"bucket_sort": {
"sort": [],
"from": 10,
"size": 20,
"gap_policy": "SKIP"
}
}
}
}
I am not sure about what value to put in the term size. I would suggest to put a reasonable value. This limits the initial aggregation, then the second limitBucket agg will limit again the term agg. This will probably still load in memory all the documents that you limited in the terms agg. That is why it depends on your scenario, if it's reasonable not get all results (i.e. if you have tens of thousands). I.e you are doing a google like search where you don't need to jump to page 1000.
Compared to the alternative to get the data on the client side, this might save you some data transfer from ES, but as I said weight this carefully as it loads all a lot of data in ES memory and you might have memory issues in ElasticSearch

Resources