Return unique results in elasticsearch - elasticsearch

I have a use case in which I have data like
{
name: "John",
parentid": "1234",
filter: {a: '1', b: '3', c: '4'}
},
{
name: "Tim",
parentid": "2222",
filter: {a: '2', b: '1', c: '4'}
},
{
name: "Mary",
parentid": "1234",
filter: {a: '1', b: '3', c: '5'}
},
{
name: "Tom",
parentid": "2222",
filter: {a: '1', b: '3', c: '1'}
}
expected results:
bucket:[{
key: "2222",
hits: [{
name: "Tom" ...
},
{
name: "Tim" ...
}]
},
{
key: "1234",
hits: [{
name: "John" ...
},
{
name: "Mary" ...
}]
}]
I want to return unique document by parentid. Although I can use top aggregation but I don't how can I paginate the bucket. As there is more chance of parentid being different than same. So mine bucket array would be large and I want to show all of them but by paginating them.

There is no direct way of doing this. But you can follow these steps to get desired result.
Step 1. You should know all parentid. This data can be obtained by doing a simple terms aggregation (Read more here) on field parentid and you will get only the list of parentid, not the documents matching to that. In the end you will have a smaller array on than you are currently expectig.
{
"aggs": {
"parentids": {
"terms": {
"field": "parentid",
"size": 0
}
}
}
}
size: 0 is required to return all results. Read more here.
OR
If you already know list of all parentid then you can directly move to step 2.
Step 2. Fetch related documents by filtering documents by parentid and here you can apply pagination.
{
"from": 0,
"size": 20,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"term": {
"parentid": "2222"
}
}
}
}
}
from and size are used for pagination, so you can loop through each of parentid in the list and fetch all related documents.

If you are just looking for all names grouped by parent id, you can use below query:
{
"query": {
"match_all": {}
},"aggs": {
"parent": {
"terms": {
"field": "parentid",
"size": 0
},"aggs": {
"NAME": {
"terms": {
"field": "name",
"size": 0
}
}
}
}
},"size": 0
}
If you want the entire document grouped by parentdId, it will be a 2 step process as explained by Sumit above and you can use pagination there.
Aggregation doesn't give you access to all documents/document-ids in the agg result, so this will have to be a 2 step process.

Related

Elasticsearch - Sort query based on collapse results

I'm trying to group/stack items based on their SKU.
Currently if sorting from high to low, an item thats being sold for $10 or $1, will show the $1 item first (because it's also sold for $10 it will be placed in front of the array ofcourse). The sorting should only respect the lowest_price for its sorting operation, for only that specific SKU.
Is there a way so I can do sorting based on the lowest_price of for every SKU and only return 1 single item per SKU?
If the results from the collapse could be used as variable for the sorting, this could be solved but I haven't been able to find out how this work.
My item object looks like this:
{
itemId: String,
sku: String,
price: Number
}
This is my query:
let itemsPerPage = 25;
let searchQuery = {
from: itemsPerPage * page,
size: itemsPerPage,
_source: ['itemId'],
sort: [{'sale.price': 'desc'}],
query: {
bool: {
must: [],
must_not: []
}
},
collapse: {
field: 'sku',
inner_hits: [{
name: 'lowest_price',
size: 1,
_source: ['itemId'],
sort: [{
'price': 'asc'
}]
}
],
}
};
You need to add sort underneeth collapse.
example:
GET /test/_search
{
"query": {
"function_score": {
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"match" : {
"job_status" : "SUCCESS"
}
}
]
}
}
}
}
}
},
"collapse": {
"field": "run_id.keyword"
},
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
]
}
This may solve your issue.

Elasticsearch match against filter only

We have a multi-tenant index and need to perform queries against the index for a single tenant only. Basically, for all documents that match the filter, return any documents that match the following query, but do not include documents that only match the filter.
For example, say we have a list of documents document like so:
{ _id: 1, account_id: 1, name: "Foo" }
{ _id: 2, account_id: 2, name: "Bar" }
{ _id: 3, account_id: 2, name: "Foo" }
I thought this query would work but it doesn't:
{
"bool": {
"filter": { "term": { "account_id": 2 } },
"should": [
{ "match": { "name": "Foo" }
]
}
}
It returns both documents matching account_id: 2:
{ _id: 3, account_id: 2, name: "Foo", score: 1.111 }
{ _id: 2, account_id: 2, name: "Bar", score: 0.0 }
What I really want is it just to return document _id: 3, which is basically "Of all documents where account_id is equal to 2, return only the ones whose names match Foo".
How can I accomplish this with ES 6.2? The caveat is that the number of should and must match conditions are not always known and I really want to avoid using minimum_should_match.
Try this instead: simply replace should by must:
{
"bool": {
"filter": { "term": { "account_id": 2 } },
"must": [
{ "match": { "name": "Foo" }
]
}
}

Elastic Search get top grouped sums with additional filters (Elasticsearch version5.3)

This is my Mapping :
{
"settings" : {
"number_of_shards" : 2,
"number_of_replicas" : 1
},
"mappings" :{
"cpt_logs_mapping" : {
"properties" : {
"channel_id" : {"type":"integer","store":"yes","index":"not_analyzed"},
"playing_date" : {"type":"string","store":"yes","index":"not_analyzed"},
"country_code" : {"type":"text","store":"yes","index":"analyzed"},
"playtime_in_sec" : {"type":"integer","store":"yes","index":"not_analyzed"},
"channel_name" : {"type":"text","store":"yes","index":"analyzed"},
"device_report_tag" : {"type":"text","store":"yes","index":"analyzed"}
}
}
}
}
I want to query the index similar to the way I do using the following MySQL query :
SELECT
channel_name,
SUM(`playtime_in_sec`) as playtime_in_sec
FROM
channel_play_times_bar_chart
WHERE
country_code = 'country' AND
device_report_tag = 'device' AND
channel_name = 'channel'
playing_date BETWEEN 'date_range_start' AND 'date_range_end'
GROUP BY channel_id
ORDER BY SUM(`playtime_in_sec`) DESC
LIMIT 30;
So far my QueryDSL looks like this
{
"size": 0,
"aggs": {
"ch_agg": {
"terms": {
"field": "channel_id",
"size": 30 ,
"order": {
"sum_agg": "desc"
}
},
"aggs": {
"sum_agg": {
"sum": {
"field": "playtime_in_sec"
}
}
}
}
}
}
QUESTION 1
Although the QueryDSL I have made does return me the top 30 channel_ids w.r.t playtimes but I am confused how to add other filters too within the search i.e country_code, device_report_tag & playing_date.
QUESTION 2
Another issue is that the result set contains only the channel_id and playtime fields unlike the MySQL result set which returns me channel_name and playtime_in_sec columns. This means I want to achieve aggregation using channel_id field but result set should instead return corresponding channel_name name of the group.
NOTE: Performance over here is a top priority as this is supposed to be running behind a graph generator querying millions or even more docs.
TEST DATA
hits: [
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 1453,
playtime_in_sec: 35,
device_report_tag: "mydev",
channel_report_tag: "Sony Six",
country_code: "SE",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 145,
playtime_in_sec: 25,
device_report_tag: "mydev",
channel_report_tag: "Star Movies",
country_code: "US",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 12,
playtime_in_sec: 15,
device_report_tag: "mydev",
channel_report_tag: "HBO",
country_code: "PK",
#timestamp: "2017-08-12",
}
}
]
QUESTION1:
Are you looking to add a filter/query to the example above? If so you can simply add a "query" node to the query document:
{
"size": 0,
"query":{
"bool":{
"must":[
{"terms": { "country_code": ["pk","us","se"] } },
{"range": { "#timestamp": { "gt": "2017-01-01", "lte": "2017-08-11" } } }
]
}
},
"aggs": {
"ch_agg": {
"terms": {
"field": "ChID",
"size": 30
},
"aggs":{
"ch_report_tag_agg": {
"terms":{
"field" :"channel_report_tag.keyword"
},
"aggs":{
"sum_agg":{
"sum":{
"field":"playtime_in_sec"
}
}
}
}
}
}
}
}
You can use all normal queries/filters for elastic to pre-filter your search before you start aggregating (Regarding performance, elasticsearch will apply any filters / queries before starting to aggregate, so any filtering you can do here will help a lot)
Question2:
On the top of my head I would suggest one of two solutions (unless I'm not completely misunderstanding the question):
Add aggs levels for the fields you want in the output in the order you want to drill down. (you can nest aggs within aggs quite deeply without issues and get the bonus of count on each level)
Use the top_hits aggregation on the "lowest" level of aggs, and specify which fields you want in the output using "_source": { "include": [/fields/] }
Can you provide a few records of test data?
Also, it is useful to know which version of ElasticSearch you're running as the syntax and behaviour change a lot between major versions.

Elasticsearch searching and sorting across 2 models

I have 2 models: Products and Skus, where a Product has one or more Skus, and a Sku belongs to exactly one Product. They have the following columns:
Product: id, title, content, category_id
Sku: id, product_id, price
I'd like to be able to display 48 products per page across various search and sort configurations, but I'm having trouble translating this to elasticsearch.
For example, it's not clear to me how I would search on title while sorting the relevant results by the lowest-priced Sku for each Product. I've tried a few different things, and closest has been to index everything as belonging to the Sku, then searching like so:
size: '48',
aggs: {
group_by_product: {
terms: { field: 'product_id' }
}
},
filter: {
and: [{
bool: {
must: { range: { price: { gte: 0, lte: 50 } } }
},{
bool: {
must: { terms: { category_id: [ 1, 2, 3, 4, 5, 6 ] } }
}
}]
},
query: {
fuzzy_like_this: {
fields: [ 'title', 'content' ],
like_text: 'Chair',
fuzziness: 1
}
}
But this gives 48 matching Skus, many of which belong to the same Product, so my pagination is off if I try to combine them after the search.
What would be the best way to handle this use case?
Update
Trying with the nested method, using the following structure:
{
size: '48',
query:
{ bool:
{ should:
{ fuzzy_like_this:
{ fields: [ 'title' ],
like_text: 'chair',
fuzziness: 1 },
},
{ must:
{ nested:
{ path: 'skus',
query:
{ bool:
{ must: { range: { price: { gte: 0, lte: 100 } } }
}
}
}
}
}
}
},
sort:
{ _score: 'asc',
'skus.price':
{ nested_path: 'skus',
nested_filter:
{ range: { 'skus.price': { gte: 0, lte: 100 } } },
order: 'asc',
mode: 'min'
}
}
}
This is likely closer, but still not sure how to format it. The above gives products ordered by price, but seems to completely disregard the search field.
Since paginating aggregation results is not possible, even though the approach of including the sku inside the product is a good one, I would go with nested objects depending on the requirements for queries.
As an example query:
GET /product/test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": {
"query": "whatever",
"fuzziness": 1,
"prefix_length": 3
}
}
},
{
"nested": {
"path": "skus",
"query": {
"range": {
"skus.price": {
"gte": 11,
"lte": 50
}
}
}
}
}
]
}
},
"sort": [
{
"skus.price": {
"nested_path": "skus",
"order": "asc",
"mode": "min"
}
}
]
}

How to do an SQL like "group by" an indexed field in Elastic Search?

How can I do an SQL like group by statement on a '_search' query in elastic search?
I basically need to:
1 - Filter a bunch of items using multiple filters, queries etc. Done
2 - Put these results into buckets of unique category_id. 'category_id' is currently mapped as a 'float' field of the item document type. I also need to display one of the items matching the above filters from each bucket.
3 - Paginate through these buckets
Note: Item count: 1 Million, Unique category_id count: 60,000
I would like to get all of the data type 'items' grouped by a field called . In the results I would like to get a list of all unique 'category_id' and a single item in each category (first or any item, doesn't matter) inside this group. I'd like to be able to use "from" and "size" to paginate through these results.
For example if i had data to the effect of:
id:1, category_id: 1, color:'blue',
id:2, category_id: 1, color:'red',
id:3, category_id: 1, color:'red',
id:4, category_id: 2, color:'blue',
id:5, category_id: 2, color:'red',
id:6, category_id: 3, color:'blue',
id:7, category_id: 3, color:'blue',
id:8, category_id: 3, color:'blue',
For example i want to get all that have the color 'red' then grouped by category_id and get back data to the effect of:
category_id: 1
{
item: { id:2, category_id: 1, color:'red'}
},
category_id: 2
{
item: { id:5, category_id: 2, color:'red'}
}
This is what i have so far, but it doesn't get the correct top hit, and i dont think it allows multiple filters and queries or is paginatable.
GET swap/item/_search
{
"size": 0,
"aggs": {
"color_filtered_items": {
"filter": {
"and": [
{
"terms": {
"color": [
"red"
]
}
}
]
},
"aggs": {
"group_by_cat_id": {
"terms": {
"field": "category_id",
"size": 10
},
"aggs": {
"items": {
"top_hits": {
"_source": {
"include": [
"name",
"id",
"category_id",
"color"
]
},
"size": 1
}
}
}
}
}
}
}
}
Hacks, workaround, changes to data storage suggestions welcome. Any help greatly appreciated.
Thank you all :)
The following should work , assuming that you don't want number range based aggregation for category_id.
Also you cant do pagination on aggregated results , but then you can control the size per aggregation.
{
"aggs": {
"itemsAgg": {
"terms": {
"field": "items",
"size": 10
},
"aggs": {
"categoryAgg": {
"terms": {
"field": "category_id",
"size": 10
}
}
}
}
}
}

Resources