Term aggregation consider only the prefix to aggregate - elasticsearch

In my elastic search documents I have users and some sort of representation of his place in the organization, for instance:
The CEO is position 1
The ones directly under the CEO will be 1/1, 1/2, 1/3, and so on
The ones under 1/1 will be 1/1/1, 1/1/2, 1/2/3, etc
I have an aggregration in which I want to aggregate by VP, so I want everybody under 1/1, 1/2, 1/3.
To do that I created a query like this one:
"aggs": {
"information": {
"terms":{
"field": "position",
"script": "_value.replaceAll('(1/1/[0/]*[1-9]).+', '$1')"
}
This would get the prefix and replace by the group in the regex, so everyone would have the same position, then I could make the aggregation. This has a poor performance.
I was thinking about using something like this
"aggs": {
"information": {
"terms":{
"field": "position",
"prefix": "1/1/.*'
}
So I would group by everyone that starts with 1/1 (1/1/1/1, 1/1/1/2, 1/1/1/3 would be one group, 1/1/2/1, 1/1/2/2, 1/1/2/3 would be a second group and so on).
Is it possible?

If you know beforehand that on how deep level you want to run this aggregation, you could simply store these levels at different fields:
{
"name": "Jack",
"own_level": 4,
"level_1": "1",
"level_2": "3",
"level_3": "2",
"level_4": null
}
But this would require many nested terms aggregations to reproduce the hierarchy. This version would make one such aggregation sufficient:
{
"name": "Jack",
"own_level": 4,
"level_1": "1",
"level_2": "1/3",
"level_3": "1/3/2",
"level_4": null
}
It also has simpler query filter if you want to focus on people under for example 1/1 by having a filter on field level_2 and terms aggregation on field level_3.
If you don't know the maximum level of the hierarchy you can use nested documents like this, but then queries and aggregations get a bit more complex:
{
"name": "Jack",
"own_level": 4,
"bosses": [
{
"level": 1,
"id": "1"
},
{
"level": 2,
"id": "1/3"
},
{
"level": 3,
"id": "1/3/2"
}
]
}

Related

Elasticsearch Rank based on rarity of a field value

I'd like to know how can I rank lower items, which have fields that are frequently appearing among the results.
Say, we have a similar result set:
"name": "Red T-Shirt"
"store": "Zara"
"name": "Yellow T-Shirt"
"store": "Zara"
"name": "Red T-Shirt"
"store": "Bershka"
"name": "Green T-Shirt"
"store": "Benetton"
I'd like to rank the documents in such a manner that the documents containing frequently found fields,
"store" in this case, are deboosted to appear lower in the results.
This is to achieve a bit of variety, so that the search doesn't yield top results from the same store.
In the example above, if I search for "T-Shirt", I want to see one Zara T-Shirt at the top and the rest
of Zara T-Shirts should be appearing lower, after all other unique stores.
So far I tried to research for using aggregation buckets for sorting or script sorting, but without success.
Is it possible to achieve this inside of the search engine?
Many thanks in advance!
This is possible with a combination of diversified sampler aggregation and top hits aggregation, as learned from the Elastic forum. I don't know what the performance implications are, if used on a high-load production system. Here is a code example, use at your own risk:
{
"query": {}, // whatever query
"size": 0, // since we don't use hits
"aggs": {
"my_unbiased_sample": {
"diversified_sampler": {
"shard_size": 100,
"field": "store"
},
"aggs": {
"keywords": {
"top_hits": {
"_source": {
"includes": [ "name", "store" ]
},
"size": 100
}
}
}
}
}
}

ElasticSearch. How can I get one document without counting all documents by filter?

I want to get any document by a filter if it exists and I do not want ElasticSearch to count how many documents fit this filter.
Example, there are docs:
{"name": "dima", "age": 15},
{"name": "amid", "age": 15}
I want one document (size=1) where age is 15, I don't want ElasticSearch to waste time counting all
I do not need this:
"hits": {
"total": {
"value": 2,
...
},
You can add a size field to tell Elastic how many docs to return, if you'd only like 1 then you can just use 1 in the size field. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
For example:
GET my-index/_search
{
"size": 1,
"query": {
"bool": {
"filter": [
{
"term": {
"age": "15"
}
}
]
}
}
}
Alternatively you can just specify the size in the query string params
my-index/_search?size=1
I understand that this is not exactly what you want, because you'd rather Elastic just stop looking as soon as it finds the first one, but don't think that is a possibilty .. check out the conversation on Elastic here https://discuss.elastic.co/t/how-to-stop-searching-on-first-match/15507/3

Elasticsearch: Product-variant-price modelling and query problem

I want to use Elasticsearch to improve performance on product search (duh) in an e-commerce solution. We have a data model where a product can have multiple variants and each variant can have one or more prices (sometime quite a substantial number of prices).
The user, query-time, chooses if (s)he wants to return products or variants and only one price should be returned (the lowest valid price, each price have a number of fields like valid from-to and valid customer groups).
My first approach was to denormalize product/variants and have prices as nested fields, but this was quite slow and I had a few problems sorting (I think on price, but the exact details eludes me right now).
Second approach was to totally denormalize so all product/variant/price combination is represented as a document. This approach is much faster (obviously), I can aggregate on productId or variantId and get the lowest price but the problem is that I cannot sort the aggregates on non-numeric or non-aggregate fields.
Denormalized documents (productId, variantId are keyword fields, price is numeric, validFrom/-To are date and the rest is text):
[
{
"productId": "111-222-333",
"variantId": "aaa-bbb-ccc",
"product_title": "Mega-product",
"product_description": "This awesome piece of magic will change your life",
"variant_title": "Green mega-product",
"variant_description": "Behold the awesomeness of the green magic mega-product",
"color": [
"blue",
"green"
],
"brand": "DaBrand",
"validFrom": "2019-06-01T00:00:00Z",
"validTo": null,
"price": 399
},
{
"productId": "111-222-333",
"variantId": "aaa-bbb-ddd",
"product_title": "Mega-product",
"product_description": "This awesome piece of magic will change your life",
"variant_title": "Blue mega-product",
"variant_description": "Behold the awesomeness of the blue magic mega-product",
"color": [
"blue",
"green"
],
"brand": "DaBrand",
"validFrom": "2019-06-01T00:00:00Z",
"validTo": null,
"price": 499
},
{
"productId": "111-222-333",
"variantId": "aaa-bbb-ddd",
"product_title": "Mega-product",
"product_description": "This awesome piece of magic will change your life",
"variant_title": "Blue mega-product",
"variant_description": "Behold the awesomeness of the blue magic mega-product",
"color": [
"blue",
"green"
],
"brand": "DaBrand",
"validFrom": "2019-06-05T00:00:00Z",
"validTo": "2019-06-10T00:00:00Z",
"price": 399
}
]
An example of a working query where I sort on the aggregated price.
{
"size": 1,
"sort": {
"product_name_text_en.keyword": "asc"
},
"query": {
// All the query and filtering
},
"aggs": {
"by_product_id": {
"terms": {
"field": "product_id_string",
"order": {
"min_price": "desc"
}
},
"aggs": {
"min_price": {
"min": {
"field": "price_decimal"
}
}
}
}
}
}
However, using this approach I cannot find a way to sort on document fields. It is possible (I think) on numeric, boolean and date fields using bucket_sort, but I need to be able to sort on, for example, brand or title field (which are text). If it would've been possible to order on a top_hits aggregation I would be home free, but that's unfortunately not possible as I understand from the docs (I've also tried it just to make sure).
Can anyone guide me to a better solution? I don't mind if I have to do the query in two steps, but to make that work for sorting I likely need to have a few different "document types", like Product, Variant, ProductPrice and VariantPrice to use depending on the requested sort order. I'm not the far gone so remodelling is definitively on the table, I've considered using join fields, but I'm not sure that would be performant.
Since the number of products and variants (and prices) can be significant - a million products is definitively on the table, I think I will have problems getting Id's from a query (for example filtering on brand and sorting on title) and then sending them into a get-best-price-query.
I figured this out by accident when I was reading the docs for another case. It all became very simple when I found out about Field collapsing. I feel like I should've known about this...
The index have the same model as in my initial question but the query became much simpler:
{
"size": 10,
"query": {
// filter/match stuff, including filtering valid prices.
},
"collapse": {
"field": "productId",
"inner_hits": {
"name": "least_price",
"collapse": {
"field": "price"
},
"size": 1,
"sort": [
{
"price": "asc"
}
]
}
},
"sort": [
{
"brand.keyword": "asc"
}
]
}
And to return variants instead of products I just collapse on variantId
The collapsing is based on productId or variantId and the least_price for the inner_hits returns the document with the least price (asc sorted by price and picking the first) of the document matching my criterias. Works like a charm.

Is there a way to do group by aggregations and get all the documents that belong to particular group aggregate?

Is there a way to do group by aggregations and get all the documents that belong to particular group aggregate ?
so this is not like group by aggregation where for each group you get some aggregate/metrics but also I want all the records that lead to a particular group aggregate in one query. Is that possible in ES today?
For Example:
Input Dataset
{"name": "foo", "amount": 5, "city":"san francisco", "state": "CA"}
{"name": "foo", "amount": 10, "city":"Los angeles", "state": "CA"}
{"name": "bar", "amount": 20, "city":"Austin", "state": "TX"}
Now say I want to group by name and state and get sum of "amount" and count for each group and the records themselves that lead to aggregate results. so the expected output is like this
Expected Output:
[
{group: {"name": "foo", "state": "CA"}, "amount": 15, "count": 2, "docs": [{"name": "foo", "amount": 5, "city":"san francisco", "state": "CA"}, {"name": "foo", "amount": 10, "city":"Los angeles", "state": "CA"}]},
{group: {"name": "bar", "state": "TX"}, "amount": 20, "count": 1, "docs": [{"name": "bar", "amount": 20, "city":"Austin", "state": "TX"}]}
]
ES 5.0 is fine.
You can use a combination of sub aggregations to get all your group by metrics, but it is a bad idea to try to get the hits returned as part of the aggregation. For N documents you are grouping over, you are essentially asking Elasticsearch to return every single document which defeats the purpose of aggregating in the first place.
Each field you are "grouping" on (in ES parlance, term aggregating) needs to be its own aggregation but you can nest them infinitely and programmatically serialize and deserialize the results according to the number of groupings you define. Make sure your term fields are "keyword" types!
This query will give you all the metrics you want-- you just need to flatten the result app-side:
{
"aggs" : {
"by_name" : {
"terms" : { "field" : "name" },
"aggs" : {
"by_state" : {
"terms" : { "field" : "state" },
"aggs" : {
"total_amount" : { "sum" : { "field" : "amount" } }
}
}
}
}
}
}
If you really need those documents, can you use term filters to dynamically load them? Alternatively, if you really need to hack it and you understand the distribution of your data, you can use the top_hits sub aggregation to return the documents. Be aware that each additional sub aggregation, especially top hits, will impact performance.

Elasticsearch - how to do field collapsing and get Distinct results? (actual records, not just counters)

In relational db our data looks like this:
Company -> Department -> Office
Elasticsearch version of the same data (flattened):
{
"officeID": 123,
"officeName": "office 1",
"state": "CA",
"department": {
"departmentID": 456,
"departmentName": "Department 1",
"company": {
"companyID": 789,
"companyName": "Company 1",
}
}
},{
"officeID": 124,
"officeName": "office 2",
"state": "CA",
"department": {
"departmentID": 456,
"departmentName": "Department 1",
"company": {
"companyID": 789,
"companyName": "Company 1",
}
}}
We need to find department (or company) by providing office information (such as state).
For example, since all I need is a department info, I can specify it like this (we are using Nest)
searchDescriptor = searchDescriptor.Source(x => x.Include("department"));
and get all departments with qualifying offices.
The problem is - I am getting multiple "department" records with the same id (one for each office).
We are using paging and sorting.
Would it be possible to get paged and sorted Distinct results?
I have spent a few days trying to find an answer (exploring options like facets, aggregations, top_hits etc) but so far the only working option I see would be a manual one - get results from Elasticsearch, group data manually and pass to the client. The problem with this approach is obvious - every time I grab next portion, I'll have to get X extra records just in case some of the records will be duplicate; since I don't know X in advance (and number of such records could be huge) will be forced either to get lots of data unnecessarily (every time I do the search) or to hit our search engine several times until I get required number of records.
So far I was unable to achieve my goal using aggregations (all I am getting is document count, but I want actual data; when I try to use top_hits, I am getting data, but those are really top hits (sorted by number of offices per department, ignoring sorting I have specified in the query); here is an example of the code I tried:
searchDescriptor = searchDescriptor.Aggregations(a => a
.Terms("myunique",
t =>
t.Field("department.departmentID")
.Size(10)
.Aggregations(
x=>x.TopHits("mytophits",
y=>y.Source(true)
.Size(1)
.Sort(k => k.OnField("department.departmentName").Ascending())
)
)
)
);
Does anyone know if Elasticsearch can perform operations like Distinct and get unique records?
Update:
I can get results using top_hits (see below), but in this case I won't be able to use paging (looks like Elasticsearch aggregations feature doesn't support paging), so I am back to square one...
{
"from": 0,
"size": 33,
"explain": false,
"sort": [
{
"departmentID": {
"order": "asc"
}
}
],
"_source": {
"include": [
"department"
]
},
"aggs": {
"myunique": {
"terms": {
"field": "department.departmentID",
"order": {
"mytopscore": "desc"
}
},
"aggs": {
"mytophits": {
"top_hits": {
"size": 5,
"_source": {
"include": [
"department.departmentID"
]
}
}
},
"mytopscore": {
"max": {
"script": "_score"
}
}
}
}
},
"query": {
"wildcard" : { "officeName" : "some office*" }
}
}

Resources