Restructuring Elasticsearch model for fast aggregations - elasticsearch

My business domain is real estate listings, and i'm trying to build a faceted UI. So i need to do aggregations to know how many listings have 1 beds, 2 beds, how many in this price range, how many with a pool etc etc. Pretty standard stuff.
Currently my model is like this:
{
"beds": 1,
"baths": 1,
"price": 100000,
"features": ['pool','aircon'],
"inspections": [{
"startsOn": "2019-01-20"
}]
}
To build my faceted UI, i'm doing multiple aggregations, e.g:
{
"aggs" : {
"beds" : {
"terms" : { "field" : "beds" }
},
"baths" : {
"terms" : { "field" : "baths" }
},
"features" : {
"terms" : { "field" : "features" }
}
}
}
You get the idea. If i've got 10 fields, i'm doing 10 aggregations.
But after seeing this article, i'm thinking i should just re-structure my model to be like this:
{
"beds": 1,
"baths": 1,
"price": 100000,
"features": ['pool','aircon'],
"attributes": ['bed_1','bath_1','price_100000-200000','has_pool','has_aircon','has_inspection_tomorrow']
}
Then i only need the 1 agg:
{
"aggs": {
"attributes": {
"terms": {
"field": "attributes"
}
}
}
}
So i've got a couple of questions.
Is the only drawback in this approach that logic is moved to the client? If so, im happy with this - for performance, since i don't see this logic changing very often.
Can i leverage this field in my queries too? For example, what if i wanted to match all documents with 1 bedroom and price = 100000 and with a pool, etc. Terms queries work on an 'any' match, but how can i find documents where the array of values contain all the provided terms?
Alternatively, if you can think of a better structure for modelling for search speed, please let me know!
Thanks

For the second point your can use the terms set query (doc here).
This query is like a terms query, but you will have control over how many terms must match.
You can configure it through a script like that :
GET /my-index/_search
{
"query": {
"terms_set": {
"codes" : {
"terms" : ["bed_1","bath_1","price_100000-200000"],
"minimum_should_match_script": {
"source": "params.num_terms"
}
}
}
}
}
will require all params to match

Related

Elasticsearch 5 (Searchkick) Aggregation Bucket Averages

We have an ES index holding scores given for different products. What we're trying to do is aggregate on product names and then get the average scores for each of product name 'buckets'. Currently the default aggregation functionality only gives us the counts for each bucket - is it possible to extend this to giving us average score per product name?
We've looked at pipeline aggregations but the documentation is pretty dense and doesn't seem to quite match what we're trying to do.
Here's where we've got to:
{
"aggs"=>{
"prods"=>{
"terms"=>{
"field"=>"product_name"
},
"aggs"=>{
"avgscore"=>{
"avg"=>{
"field"=>"score"
}
}
}
}
}
}
Either this is wrong, or could it be that there's something in how searckick compiles its ES queries that is breaking things?
Thanks!
Think this is the pipeline aggregation you want...
POST /_search
{
"size": 0,
"aggs": {
"product_count" : {
"terms" : {
"field" : "product"
},
"aggs": {
"total_score": {
"sum": {
"field": "score"
}
}
}
},
"avg_score": {
"avg_bucket": {
"buckets_path": "product_count>total_score"
}
}
}
}
Hopefully I have that the right way round, if not - switch the first two buckets.

ElasticSearch return non analyzed version of analyzed aggregate

I am having a problem implementing a autocomplete feature using the data in elastic search.. my documents currently have this kind of structure
PUT mainindex/books/1
{
"title": "The unread book",
"author": "Mario smith",
"tags": [ "Comedy", "Romantic" , "Romantic Comedy","México"]
}
all the fields are indexed, and the mapping for the tags is a lowercase,asciifolding filter..
Now the functionality that is required is that if the user types mario smith rom..., I need to sugest tags starting with rom.. but only for books of mario smith.. this required breaking the text into components.. and I already got that part.. the current query is something like this ..
{
"query": {
"query_string": {
"query": "mario smith",
"default_operator": "AND"
}
},
"size": 0,
"aggs": {
"autocomplete": {
"terms": {
"field": "suggest",
"order": {
"_term": "asc"
},
"include": {
"pattern": "rom.*"
}
}
}
}
}
and this returns the expected result, a list of word that the user should type next based on the query.. and the prefix of the word he is starting to type..
{
"aggregations" : {
"autocomplete" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "romantic comedy",
"doc_count" : 4
},
{
"key" : "romantic",
"doc_count" : 2
}
]
}
}
}
now the problem is that I can't present these words to the user because they are lowercase and without accents words liker México got indexed like mexico.. and in my language makes some words look weird.. if i remove the filters from the tag field the values are correctly saved into the index. but the pattern rom.* will not match because the user is typing in a diferrent case and may not use the correct accents..
in general terms what is need is to take a filtered set of documents.. aggregate their tags, return them in their natural format.. but filter out the ones that dont have the same prefix. filtering them in a case/accent insentitive way..
PS: I saw some suggestions about having 2 versions of the field,one analyzed and one raw.. but cant seem to be able to filter by one and return the other..
does anyone have an idea, how perform this query or implement this functionality?

How to limit ElasticSearch results by a field value?

We've got a system that indexes resume documents in ElasticSearch using the mapper attachment plugin. Alongside the indexed document, I store some basic info, like if it's tied to an applicant or employee, their name, and the ID they're assigned in the system. A query that runs might look something like this when it hits ES:
{
"size" : 100,
"query" : {
"query_string" : {
"query" : "software AND (developer OR engineer)",
"default_field" : "fileData"
}
},
"_source" : {
"includes" : [ "applicant.*", "employee.*" ]
}
}
And gets me results like:
"hits": [100]
0: {
"_index": "careers"
"_type": "resume"
"_id": "AVEW8FJcqKzY6y-HB4tr"
"_score": 0.4530588
"_source": {
"applicant": {
"name": "John Doe"
"id": 338338
}
}
}...
What I'm trying to do is limit the results, so that if John Doe with id 338338 has three different resumes in the system that all match the query, I only get back one match, preferably the highest scoring one (though that's not as important, as long as I can find the person). I've been trying different options with filters and aggregates, but I haven't stumbled across a way to do this.
There are various approaches I can take in the app that calls ES to tackle this after I get results back, but if I can do it on the ES side, that would be preferable. Since I'm limiting the query to say, 100 results, I'd like to get back 100 individual people, rather than getting back 100 results and then finding out that 25% of them are docs tied to the same person.
What you want to do is an aggregation to get the top 100 unique records, and then a sub aggregation asking for the "top_hits". Here is an example from my system. In my example I'm:
setting the result size to 0 because I only care about the aggregations
setting the size of the aggregation to 100
for each aggregation, get the top 1 result
GET index1/type1/_search
{
"size": 0,
"aggs": {
"a1": {
"terms": {
"field": "input.user.name",
"size": 100
},
"aggs": {
"topHits": {
"top_hits": {
"size": 1
}
}
}
}
}
}
There's a simpler way to accomplish what #ckasek is looking to do by making use of Elasticsearch's collapse functionality.
Field Collapsing, as described in the Elasticsearch docs:
Allows to collapse search results based on field values. The collapsing is done by selecting only the top sorted document per collapse key.
Based on the original query example above, you would modify it like so:
{
"size" : 100,
"query" : {
"query_string" : {
"query" : "software AND (developer OR engineer)",
"default_field" : "fileData"
}
},
"collapse": {
"field": "id",
},
"_source" : {
"includes" : [ "applicant.*", "employee.*" ]
}
}
Using the answer above and the link from IanGabes, I was able to restructure my search like so:
{
"size": 0,
"query": {
"query_string": {
"query": "software AND (developer OR engineer)",
"default_field": "fileData"
}
},
"aggregations": {
"employee": {
"terms": {
"field": "employee.id",
"size": 100
},
"aggregations": {
"score": {
"max": {
"script": "scores"
}
}
}
},
"applicant": {
"terms": {
"field": "applicant.id",
"size": 100
},
"aggregations": {
"score": {
"max": {
"script": "scores"
}
}
}
}
}
}
This gets me back two buckets, one containing all the applicant Ids and the highest score from the matched docs, as well as the same for employees. The script is nothing more than a groovy script on the shard that contains '_score' as the content.

Elasticsearch tags aggregation with specific keys

I have an array field with tags and fixed list of 10 most popular tags (I got it from previous terms aggregations call).
Can I determine document counts for current search exactly with this keys (tags from my array)? Like terms aggregation, but for specific keys only.
Thanks!
Take a look at filtering terms aggregations, especially the include parameter. It would be easier to show you if you provided a specific example of your problem, but here is the example from the docs that should help you figure out how to solve your problem:
{
"aggs" : {
"JapaneseCars" : {
"terms" : {
"field" : "make",
"include" : ["mazda", "honda"]
}
},
"ActiveCarManufacturers" : {
"terms" : {
"field" : "make",
"exclude" : ["rover", "jensen"]
}
}
}
}
You can use include or exclude keywords inside aggregations to filter your keys.
{
"size": 0,
"aggs": {
"my_agg": {
"terms": {
"field": "agg_field",
"include": [key1,key2,key3]
}
}
}
}

Elasticsearch grouping facet by owner, mine vs others

I am using Elasticsearch to index documents that have an owner which is stored in a userId property of the source object. I can easily do a facet on the userId and get facets for each owner that there is, but I'd like to have the facets for owner show up like so:
Documents owned by me (X)
Documents owned by others (Y)
I could handle this on the client side and take all of the facets returned by elasticsearch and go through them and figure out those owned by the current user and not and display it appropriately, but I was hoping there was a way to tell elasticsearch to handle this in the query itself.
You can use filtered facets to do this:
curl -XGET "http://localhost:9200/_search" -d'
{
"query": {
"match_all": {}
},
"facets": {
"my_docs": {
"filter": {
"term": { "user_id": "my_user_id" }
}
},
"others_docs": {
"filter": {
"not": {
"term": { "user_id": "my_user_id" }
}
}
}
}
}'
One of the nice things about this is that the two terms filters are identical and so are only executed once. The not filter just inverts the results of the cached term filter.
You're right, ElasticSearch has a way to do that. Take a look to scripting term facets, specially to the second example ("using the boolean feature"). You should be able to do somthing like:
{
"query" : {
"match_all" : { }
},
"facets" : {
"userId" : {
"terms" : {
"field" : "userId",
"size" : 10,
"script" : "term == '<your user id>' ? true : false"
}
}
}
}

Resources